import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import displayIntroduction
Chronic absenteeism—defined here as missing more than 10% of enrolled days—can have a profound impact on a student’s academic performance and future prospects. Frequent absences correlate with lower grades, increased dropout rates, and diminished long-term outcomes. By understanding which factors drive absenteeism, educators and policymakers can design more targeted interventions.
In this notebook, I analyze records from seven schools over seven academic years. My overarching goal is to identify patterns and predictors of chronic absenteeism, with the ultimate aim of guiding effective attendance-improvement strategies.
Motivation
Student attendance is more than just a line item in school administrative data. It’s linked to:
- Academic Achievement: Students who attend school regularly often see better test scores and improved GPAs.
- Equity Considerations: Certain demographic groups (e.g., English Learners, low-income students) might face unique challenges that affect attendance.
- Resource Allocation: Understanding absenteeism helps schools allocate counselors, attendance monitors, and program funding more effectively.
- Long-Term Impact: Chronic absenteeism in earlier grades can ripple through a student’s educational trajectory, potentially influencing high school completion and college readiness.
By examining variables such as attendance rates, days enrolled, demographic details, and special program status, I seek to paint a clearer picture of how these features interrelate and what they reveal about absenteeism trends.
Dataset Description
The dataset in this study is collected from seven schools in the Oakland Unified School District over seven years, from 2012 to 2024. The seven schools are:
- Castlemont
- East Oakland PRIDE
- Elmhurst United
- MLK
- McClymonds
- Prescott
- WOMS
Each file spans multiple academic years, resulting in a total of over 20,000 student-year observations. The key columns include:
- ANON_ID: A unique, anonymized student identifier.
- Birthdate: Date of birth, from which I derived each student’s approximate age.
- Gen: Student’s gender.
- Eth: Ethnicity (e.g., Latino, African American, White, etc.).
- Fluency: English proficiency classification (e.g., EO, EL, RFEP, IFEP).
- SpEd: Special Education status (e.g., “Special Ed” or “Not Special Ed”).
- Grade: Student’s grade for that academic year (-1 through 12 in this dataset).
- AttRate: Attendance rate (1 – [Days Absent / Days Enrolled]).
- DaysEnr: Number of days enrolled in that school year.
- DaysAbs: Number of days absent in that school year.
- Susp: Number of suspensions.
- CurrWeightedTotGPA: Weighted GPA, if applicable.
- SED: Socio-economically Disadvantaged status.
- School: Name of the school.
- Year: Academic year label (e.g., “17-18”, “18-19”).
- Age: Computed from birth year and the school year.
- ChroAbs: Binary variable (1 if absent more than 10% of enrolled days, 0 otherwise).
1. Dataset Overview
# Load 7 schools dataset
castlemont = pd.read_csv('data/Castlemont.csv')
eof = pd.read_csv('data/East Oakland PRIDE.csv')
elmhurst_united = pd.read_csv('data/Elmhurst United.csv')
mlk = pd.read_csv('data/MLK.csv')
mcClymonds = pd.read_csv('data/McClymonds.csv')
prescott = pd.read_csv('data/Prescott.csv')
woms = pd.read_csv('data/ONGB -Lan/WOMS.csv')
# castlemont.head()# Concatenate into one dataframe
schools = pd.concat([castlemont, eof, elmhurst_united, mlk, mcClymonds, prescott, woms]).reset_index(drop=True)
schools.head(5)| ANON_ID | Birthdate | Gen | Eth | Fluency | SpEd | Grade | AttRate | DaysEnr | DaysAbs | Susp | CurrWeightedTotGPA | SED | School | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 338 | 2003-07-21 | F | Latino | RFEP | Not Special Ed | 9 | 0.9889 | 180.0 | 2.0 | NaN | 3.65 | Unknown | Castlemont | 17-18 |
| 1 | 340 | 2003-03-31 | M | Latino | EL | Not Special Ed | 9 | 0.8389 | 180.0 | 29.0 | 2.0 | 0.06 | Unknown | Castlemont | 17-18 |
| 2 | 478 | 2003-09-06 | F | Multiple Ethnicity | EO | Special Ed | 9 | 0.7263 | 179.0 | 49.0 | 1.0 | 0.24 | Unknown | Castlemont | 17-18 |
| 3 | 686 | 2000-04-02 | M | Latino | EL | Not Special Ed | 12 | 0.9611 | 180.0 | 7.0 | NaN | 2.00 | Unknown | Castlemont | 17-18 |
| 4 | 693 | 2002-03-28 | F | Latino | RFEP | Not Special Ed | 10 | 0.9889 | 180.0 | 2.0 | NaN | 2.59 | Unknown | Castlemont | 17-18 |
# Add age (school year - birthdate) column
def calculate_age(df):
df['Birthdate'] = pd.to_datetime(df['Birthdate'])
end_year = df['Year'].str.split('-').str[0].astype(int)
year = end_year.apply(lambda x: 2000 + x)
df['Age'] = year - df['Birthdate'].dt.year
return df
schools = calculate_age(schools)
schools.head(5)| ANON_ID | Birthdate | Gen | Eth | Fluency | SpEd | Grade | AttRate | DaysEnr | DaysAbs | Susp | CurrWeightedTotGPA | SED | School | Year | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 338 | 2003-07-21 | F | Latino | RFEP | Not Special Ed | 9 | 0.9889 | 180.0 | 2.0 | NaN | 3.65 | Unknown | Castlemont | 17-18 | 14 |
| 1 | 340 | 2003-03-31 | M | Latino | EL | Not Special Ed | 9 | 0.8389 | 180.0 | 29.0 | 2.0 | 0.06 | Unknown | Castlemont | 17-18 | 14 |
| 2 | 478 | 2003-09-06 | F | Multiple Ethnicity | EO | Special Ed | 9 | 0.7263 | 179.0 | 49.0 | 1.0 | 0.24 | Unknown | Castlemont | 17-18 | 14 |
| 3 | 686 | 2000-04-02 | M | Latino | EL | Not Special Ed | 12 | 0.9611 | 180.0 | 7.0 | NaN | 2.00 | Unknown | Castlemont | 17-18 | 17 |
| 4 | 693 | 2002-03-28 | F | Latino | RFEP | Not Special Ed | 10 | 0.9889 | 180.0 | 2.0 | NaN | 2.59 | Unknown | Castlemont | 17-18 | 15 |
# Add ChroAbs column with 1 is student who absent for more than 10% of enrolled days
abs_per = schools['DaysAbs'] / schools['DaysEnr']
schools['ChroAbs'] = [1 if x > 0.1 else 0 for x in abs_per]
schools.tail(5)| ANON_ID | Birthdate | Gen | Eth | Fluency | SpEd | Grade | AttRate | DaysEnr | DaysAbs | Susp | CurrWeightedTotGPA | SED | School | Year | Age | ChroAbs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21989 | 77312 | 2009-10-26 | M | African American | EO | Not Special Ed | 8 | 0.9611 | 180.0 | 7.0 | NaN | 2.39 | SED | WOMS | 23-24 | 14 | 0 |
| 21990 | 77803 | 2011-06-30 | M | Latino | EO | Special Ed | 7 | 0.6222 | 45.0 | 17.0 | NaN | 0.00 | SED | WOMS | 23-24 | 12 | 1 |
| 21991 | 78129 | 2011-10-09 | F | Latino | RFEP | Not Special Ed | 6 | 0.9441 | 179.0 | 10.0 | NaN | 3.38 | Not SED | WOMS | 23-24 | 12 | 0 |
| 21992 | 78498 | 2011-06-21 | M | African American | EO | Not Special Ed | 7 | 0.9333 | 180.0 | 12.0 | NaN | 3.66 | SED | WOMS | 23-24 | 12 | 0 |
| 21993 | 79427 | 2012-01-10 | M | Latino | RFEP | Not Special Ed | 6 | 0.7375 | 80.0 | 21.0 | NaN | 3.37 | SED | WOMS | 23-24 | 11 | 1 |
Initially, I observed that Attendance Rate and Days Absent are critical measures as they are directly linked to Chronic Absenteeism. GPA (CurrWeightedTotGPA) variable is important as I assume that lower academic performance is typically associated with higher absenteeism. Demographic factors such as Gender and Ethnicity provide context for understanding potential disparities in academic performance as well as absenteeism patterns. Fluency is another significant variable, as language barriers often correlate with performance challenges, especially for non-native speakers. The SpEd reports whether students have special educational needs, and SED indicates economic disadvantage, both of which can substantially impact educational engagement. Economic hardship, for example, has been linked to lower attendance rates and higher absenteeism.
Furthermore, the dataset have missing or inconsistent values that need to be addressed. DaysEnr variable has unusually low numbers, which could make absenteeism rates appear higher than they are. Susp has the highest number of missing values (20,556 records). There are 6,441 missing values in CurrWeightedTotGPA, accounting for a large portion of the dataset. Each of the AttRate, DaysEnr, and DaysAbs has 29 missing entries. In terms of inconsistent values, SED column has 6,317 entries labeled as Unknown. In the Fluency column, there are 100 entries marked as TBD and 6 as Unknown. These missing and inconsistent values will be addressed further
from IPython.display import display, HTML
# Generate HTML for each column's value counts
html_output = "<div style='display: flex; flex-wrap: wrap;'>"
for col in cat_cols:
# Get the value counts for the column
counts = schools[col].value_counts()
# Create an HTML table for the value counts of each column
html_output += f"""
<div style='margin-right: 20px; padding: 10px; border: 1px solid #ddd;'>
<h4>{col}</h4>
{counts.to_frame().to_html(header=False)}
</div>
"""
html_output += "</div>"
# Display all counts side by side in HTML format
display(HTML(html_output))Gen
| Gen | |
|---|---|
| M | 11857 |
| F | 10113 |
| N | 24 |
Eth
| Eth | |
|---|---|
| Latino | 10645 |
| African American | 8335 |
| White | 840 |
| Asian | 753 |
| Not Reported | 526 |
| Multiple Ethnicity | 424 |
| Pacific Islander | 358 |
| Native American | 63 |
| Filipino | 50 |
Fluency
| Fluency | |
|---|---|
| EO | 10653 |
| EL | 8281 |
| RFEP | 2688 |
| IFEP | 266 |
| TBD | 100 |
| Unknown | 6 |
SpEd
| SpEd | |
|---|---|
| Not Special Ed | 18144 |
| Special Ed | 3850 |
Grade
| Grade | |
|---|---|
| 9 | 2623 |
| 10 | 2540 |
| 7 | 2189 |
| 6 | 2182 |
| 8 | 2168 |
| 11 | 2148 |
| 12 | 1711 |
| 1 | 1089 |
| 3 | 1065 |
| 2 | 1036 |
| 4 | 1015 |
| 0 | 980 |
| 5 | 940 |
| -1 | 308 |
SED
| SED | |
|---|---|
| SED | 14363 |
| Unknown | 6317 |
| Not SED | 1314 |
Age
| Age | |
|---|---|
| 16 | 2267 |
| 13 | 2253 |
| 12 | 2236 |
| 17 | 2176 |
| 15 | 2142 |
| 14 | 2134 |
| 11 | 1785 |
| 7 | 1036 |
| 8 | 1033 |
| 6 | 1020 |
| 9 | 1017 |
| 10 | 981 |
| 5 | 814 |
| 18 | 798 |
| 19 | 161 |
| 4 | 96 |
| 20 | 41 |
| 21 | 4 |
Year
| Year | |
|---|---|
| 19-20 | 3375 |
| 20-21 | 3197 |
| 21-22 | 3169 |
| 23-24 | 3124 |
| 22-23 | 3113 |
| 18-19 | 3034 |
| 17-18 | 2982 |
School
| School | |
|---|---|
| Castlemont | 6456 |
| Elmhurst United | 4989 |
| MLK | 2720 |
| East Oakland PRIDE | 2634 |
| McClymonds | 2566 |
| WOMS | 1550 |
| Prescott | 1079 |
ChroAbs
| ChroAbs | |
|---|---|
| 0 | 11882 |
| 1 | 10112 |
# Info for all schools over past 7 years
schools.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21994 entries, 0 to 21993
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ANON_ID 21994 non-null int64
1 Birthdate 21994 non-null datetime64[ns]
2 Gen 21994 non-null object
3 Eth 21994 non-null object
4 Fluency 21994 non-null object
5 SpEd 21994 non-null object
6 Grade 21994 non-null int64
7 AttRate 21965 non-null float64
8 DaysEnr 21965 non-null float64
9 DaysAbs 21965 non-null float64
10 Susp 1438 non-null float64
11 CurrWeightedTotGPA 15553 non-null float64
12 SED 21994 non-null object
13 School 21994 non-null object
14 Year 21994 non-null object
15 Age 21994 non-null int64
16 ChroAbs 21994 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(4), object(7)
memory usage: 2.9+ MB
# Missing values total
schools.isna().sum()| 0 | |
|---|---|
| ANON_ID | 0 |
| Birthdate | 0 |
| Gen | 0 |
| Eth | 0 |
| Fluency | 0 |
| SpEd | 0 |
| Grade | 0 |
| AttRate | 29 |
| DaysEnr | 29 |
| DaysAbs | 29 |
| Susp | 20556 |
| CurrWeightedTotGPA | 6441 |
| SED | 0 |
| School | 0 |
| Year | 0 |
| Age | 0 |
| ChroAbs | 0 |
# Missing values percentage
schools.isna().mean()| 0 | |
|---|---|
| ANON_ID | 0.000000 |
| Birthdate | 0.000000 |
| Gen | 0.000000 |
| Eth | 0.000000 |
| Fluency | 0.000000 |
| SpEd | 0.000000 |
| Grade | 0.000000 |
| AttRate | 0.001319 |
| DaysEnr | 0.001319 |
| DaysAbs | 0.001319 |
| Susp | 0.934619 |
| CurrWeightedTotGPA | 0.292853 |
| SED | 0.000000 |
| School | 0.000000 |
| Year | 0.000000 |
| Age | 0.000000 |
| ChroAbs | 0.000000 |
2. Exploratory Data Analysis
2.a Numerical Distributions: Attendance Rate, Days Enrolled, Days Absent
def plot_distribution_and_summary(df, col, title=None):
"""
Plots the distribution histogram and statistical summary for a specified column.
Parameters:
- df: DataFrame containing the data
- col: str, the name of the column to plot and summarize
- title: modified title for the plot
"""
# Calculate the statistical summary for the specified column
summary_stats = df[col].describe()
# Set up a figure with 2 subplots: 1 for the plot, 1 for the text
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6), gridspec_kw={'width_ratios': [3, 1]})
# Plot the histogram for the specified column
if title:
ax1.set_title(title, fontweight='bold', color='darkblue')
else:
ax1.set_title(f'{col} Overall Distribution', fontweight='bold', color='darkblue')
ax1.set_xlabel(f'{col}', fontsize=16)
ax1.set_ylabel('Frequency', fontsize=16)
sns.histplot(data=df, x=col, ax=ax1)
ax1.grid(True, linestyle='--', alpha=0.5)
# Display the statistical summary on the right subplot
ax2.axis('off') # Turn off the axis for the text display
summary_text = summary_stats.to_string()
ax2.text(0, 0.5, summary_text, fontsize=12, va='center', ha='left', fontweight='bold', color='navy')
# Adjust layout and show the plot
plt.tight_layout()
plt.show()
# Attendance Overall Distribution
plot_distribution_and_summary(schools, 'AttRate')
def plot_boxplot_with_counts(df, numerical_col, categorical_col):
"""
Plots a boxplot for a numerical column grouped by a categorical column
along with the count of each category.
Parameters:
- df: DataFrame containing the data
- numerical_col: str, the name of the numerical column to plot
- categorical_col: str, the name of the categorical column to group by
"""
# Define the category order based on the boxplot
category_order = df[categorical_col].value_counts().index.tolist()
# Set up the figure with two subplots: one for the boxplot and one for the bar plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), gridspec_kw={'width_ratios': [3, 1]})
# Plot the boxplot for the numerical column by the categorical column
sns.boxplot(data=df, y=categorical_col, x=numerical_col, ax=ax1, hue =categorical_col, palette="vlag", order=category_order)
ax1.set_title(f'Boxplot of {numerical_col} by {categorical_col}', fontweight='bold', color='darkblue')
ax1.set_xlabel(categorical_col, fontweight='bold')
ax1.set_ylabel(numerical_col, fontweight='bold')
ax1.grid(True, linestyle='--', alpha=0.5)
# Plot the count of each category as a bar plot
sns.countplot(data=df, y=categorical_col, order=category_order, ax=ax2, hue=categorical_col, palette='vlag')
ax2.set_title(f'Counts of Each {categorical_col} Category', fontweight='bold', color='darkblue')
ax2.set_xlabel(categorical_col, fontweight='bold')
ax2.set_ylabel('Count', fontweight='bold')
ax2.grid(True, linestyle='--', alpha=0.5)
# Adjust layout and display the plots
plt.tight_layout()
plt.show()
# Attendance Distributions by School
plot_boxplot_with_counts(schools, 'AttRate', 'School')
Observation: The boxplot shows the distribution of attendance rates across different schools. Elmhurst United displays a higher median attendance rate while Castlemont has the lowest median attendance rate with largest variability. Note that Castlemont also has the highest number of data among other schools.
# Days Absent Overall Distribution
plot_distribution_and_summary(schools, 'DaysAbs')
# Days Absent Distribution by School
plot_boxplot_with_counts(schools, 'DaysAbs', 'School')
Observation: The distribution of Days Absent closely mirrors that of Attendance Rate, indicating a strong dependency between the two. We can consider dropping the DaysAbs column later to avoid redundancy.
# Days Enrolled Overall Distribution
plot_distribution_and_summary(schools, 'DaysEnr')
# Days Enrolled Distribution by School
plot_boxplot_with_counts(schools, 'DaysEnr', 'School')
Observation: The Days Enrolled column contains numerous unusually low values. This raises questions about the underlying causes (further research on how schools report days enrolled may be necessary). We may also need to reconsider the definition of chronic absenteeism (where the ratio of days absent to days enrolled is greater than 10%), as low values for days enrolled could disproportionately impact the chronic absenteeism rate.
Key findings including:
- Most students have high attendance rates, though certain schools show significant absenteeism challenges.
- Variability in absenteeism and enrollment suggests that some schools might face systemic issues with student attendance and retention.
- The distribution of enrolled days highlights potential inconsistencies in data reporting, which could affect the accuracy of absenteeism measurements.
2b. Time Series over Year: Attendance Rate and Chronic Absenteeism Percentage
# Time series plot for attendate rate, aggregating by median
plt.figure(figsize=(10, 5))
plt.title('Attentance Rate Across Schools Over 7 Years', fontweight='bold', color='darkblue')
sns.lineplot(data=schools, x='Year', y='AttRate', hue='School', estimator='median', errorbar=None, marker='o', linewidth=0.7)
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1));
Dive deeper into MLK and Castlemont Attendance Rate Distribution
# Examine attendance rate at Castlemont, espcially year 20-21
plot_distribution_and_summary(schools[(schools['School'] == 'Castlemont') & (schools['Year'] == '20-21')], 'AttRate', "Attendance Rate at Castlemont in 20-21")
# Examine attendance rate at MLK, espcially year 22-23
plot_distribution_and_summary(schools[(schools['School'] == 'MLK') & (schools['Year'] == '22-23')], 'AttRate', "Attendance Rate at MLK in 22-23")
# Time series plot for Days Enrolled, aggregating by median
plt.figure(figsize=(10, 5))
plt.title('Days Enrolled Across Schools Over 7 Years', fontweight='bold', color='darkblue')
sns.lineplot(data=schools, x='Year', y='DaysEnr', hue='School', estimator='median', errorbar=None, marker='o', linewidth=0.7)
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
plt.grid(True, linestyle='--', alpha=0.5);
# ChronicAbsent Percentage by Year
plt.figure(figsize=(10, 5))
plt.title('Chronic Absenteeism Percentage Across Schools Over 7 Years', fontweight='bold', color='darkblue')
sns.lineplot(data=schools, x='Year', y='ChroAbs', hue='School', estimator='mean', errorbar=None, marker='o', linewidth=0.7)
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
plt.grid(True, linestyle='--', alpha=0.5);
Quick observation: Even though 19-20 has lowest median days enrolled but the percentage of chronically absent student are under 40%. School years 21-22 and 22-23 have high chronic absenteeism percentage.
Key Findings:
- Attendance Trends: Attendance rates fluctuated over the seven-year period, with notable declines during the 2018-19 and 2021-22 school years. These dips, particularly during the pandemic years, indicate district-wide challenges in maintaining student engagement.
- Days Enrolled: Enrollment days remained stable at around 180 for most years, except for a sharp drop in 2019-20, indicating potential mid-year closures or interruptions.
- Castlemont-Specific Challenges: Castlemont consistently displayed lower attendance rates and higher chronic absenteeism, especially in 2020-21, signaling a need for targeted interventions. The school’s unique struggles may be due to underlying socio-economic conditions, resource gaps, or other external factors.
2c. Demographic Factors: Gender, Ethnicity, Fluency, Special Education and Socio-economically Disadvantaged Status
def plot_bar(df, col, title=None):
"""
Plots a bar chart for a specified column.
Parameters:
- df: DataFrame containing the data
- col: str, the name of the column to plot
- title: modified title for the plot
"""
plt.figure(figsize=(10, 5))
if title:
plt.title(title, fontweight='bold', color='darkblue')
else:
plt.title(f'Counts of {col} per School over 7 Years', fontweight='bold', color='darkblue')
sns.countplot(data=df, y='School', hue=col, palette='Paired')
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
plt.show()
# Bar plot of Gender per School
plot_bar(schools, 'Gen', "Counts of Gender per School over 7 Years")
# Bar plot of Ethnicity per School
plot_bar(schools, 'Eth', "Counts of Ethnicity per School over 7 Years")
# Bar plot of Fluency per School
plot_bar(schools, 'Fluency')
# Bar plot of Special Education per School
plot_bar(schools, 'SpEd', "Counts of Special Education Status per School over 7 Years")
# Bar plot of Socio-Economically Disadvantaged Counts per school
plot_bar(schools, 'SED', "Socio-Economically Disadvantaged Status per School over 7 Years")
Key Findings:
- Gender Distribution: Most schools have a balanced distribution of male and female students, with Castlemont having the highest overall student count. The presence of a small number of “N” (likely representing non-binary or not reported) entries suggests that some schools are starting to account for gender diversity beyond the binary categories.
- Ethnic Diversity: African American and Latino students make up the majority across most schools, but their proportions vary significantly. Castlemont, East Oakland PRIDE, and Elmhurst United have a higher concentration of Latino students, while the rest have more African American students. Smaller ethnic groups, such as Pacific Islanders and Native Americans, are underrepresented, which may affect certain analyses if not accounted for properly.
- Language Fluency: English Only (EO) students dominate in all schools, particularly in MLK, McClymonds, Prescott, and WOMS. English Learners (EL) and Reclassified Fluent English Proficient (RFEP) are also significant, especially in Castlemont, East Oakland PRIDE, and Elmhurst United.
- Special Education (SpEd): A considerable number of students in each school are enrolled in special education programs, with Castlemont and Elmhurst United having the highest counts. However, the majority of students across all schools are classified as Not Special Ed. This distribution provides insights into the varied support needs within each school.
- Socio-Economically Disadvantaged (SED) Status: High counts of SED students are evident in all schools, indicating economic challenges in these communities. A large number of “Unknown” entries, particularly in Castlemont, may obscure true socio-economic trends and should be considered when analyzing the impact of SED on student outcomes.
Heat Map
Numerical Data: Pearson’s Correlation to Visualize Linear Relationship of Each Pair.
def plot_corr(cols):
corr_df = schools[cols]
corr_df.corr()
plt.figure(figsize=(15, 9))
sns.heatmap(corr_df.corr(), cmap='coolwarm', annot=True)
plt.title(f'Correlation Heat Map for Numerical Data', fontweight='bold', color='darkblue');
# Numerical data for correlation map
corr_cols = ['AttRate', 'Age', 'DaysEnr', 'DaysAbs', 'CurrWeightedTotGPA']
plot_corr(corr_cols)
Observation: Pairs that has high correlation
- DaysAb and AttRate: This pair exhibits a high correlation due to their inherent dependency. Days absent (DaysAbs) is inversely related to attendance rate (AttRate), as a higher number of absences naturally lowers the attendance rate. This dependency is expected and reflects how these two variables capture similar information.
- DaysEnr and CurrWeightedTotGPA: There is a moderate positive linear relationship between days enrolled (DaysEnr) and cumulative GPA (CurrWeightedTotGPA), with a correlation of 0.43. While this shows that students enrolled for more days tend to have a higher GPA, this relationship is not directly relevant to our analysis of chronic absenteeism and may not be a primary focus.
- AttRate and CurrWeightedTotGPA: There is a positive linear relationship between attendance rate (AttRate) and GPA, with a correlation of 0.42. This suggests that better attendance is associated with higher academic performance, which aligns with the understanding that consistent attendance is beneficial for student success. This relationship is worth investigating further, as it may provide insights into how attendance impacts academic outcomes and possibly influences chronic absenteeism.
Categorical Correlations: Using Theil’s U to Visualize Predictive Strength between Two Categorical Variables
Theil’s U is used to assess the relationship between categorical variables, offering insights into how one categorical variable can predict another. Theil’s U values range from 0 to 1, where 0 indicates no predictive power and 1 indicates perfect predictability. It’s important to note that Theil’s U is asymmetric, meaning the predictive power of variable A on variable B is not necessarily the same as the predictive power of B on A. This characteristic allows for a more nuanced understanding of relationships between categorical variables, where directionality matters.
For more details on Theil’s U and its applications in evaluating categorical correlation, refer to this article: The Search for Categorical Correlation.
# Theil'U Heat Map for Categorical Variable
from collections import Counter
import math
# Define Theil's U Calculation Function
def conditional_entropy(x, y):
y_counter = Counter(y)
xy_counter = Counter(list(zip(x, y)))
total_occurrences = sum(y_counter.values())
entropy = 0.0
for xy, xy_count in xy_counter.items():
p_xy = xy_count / total_occurrences
p_y = y_counter[xy[1]] / total_occurrences
entropy += p_xy * math.log(p_y / p_xy)
return entropy
def theils_u(x, y):
s_xy = conditional_entropy(x, y)
x_counter = Counter(x)
total_occurrences = sum(x_counter.values())
p_x = [x_count / total_occurrences for x_count in x_counter.values()]
s_x = -sum([p * math.log(p) for p in p_x])
return (s_x - s_xy) / s_x if s_x != 0 else 1
cat_cols = ['Gen', 'Eth', 'Fluency', 'SpEd', 'Grade', 'SED', 'Age', 'Year', 'School', 'ChroAbs']
# Create an empty DataFrame to store Theil's U values
theils_u_matrix = pd.DataFrame(index=cat_cols, columns=cat_cols)
# Fill the matrix with Theil's U values
for col1 in cat_cols:
for col2 in cat_cols:
if col1 == col2:
theils_u_matrix.loc[col1, col2] = 1 # Perfect association with itself
else:
theils_u_matrix.loc[col1, col2] = theils_u(schools[col1], schools[col2])
# Convert to float for heatmap compatibility
theils_u_matrix = theils_u_matrix.astype(float)
# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(theils_u_matrix, annot=True, cmap="YlGnBu", square=True)
plt.title("Theil's U Heatmap of Categorical Variables")
plt.show()
Observation: The factors with the strongest predictive power on Chronic Absenteeism are ranked as follows: Year > School > Age > Grade > Fluency > Ethnicity > SED > SpEd > Gender.
Additionally, the map shows that Grade has a strong predictive relationship with both School and Age (correlations of 0.81 and 0.67, respectively). This is logical, as schools are divided into levels (e.g., Elementary, Middle, High School), each covering specific grade ranges. Similarly, students are typically placed in grades based on their age. This alignment is reflected in the fact that these three variables—Grade, School, and Age—have similar correlation values with Chronic Absenteeism, indicating they offer comparable predictive power for absenteeism. Therefore, when building a predictive model, it may be beneficial to select only one of these variables to avoid redundancy and reduce model complexity.
Another noteworthy observation is the moderate predictive power of SED on Year (0.29), while Year has a stronger predictive power on SED (0.7). Further investigation is needed to understand the underlying reasons for this relationship.
3. Data Cleaning
##3a. Handle Missing Values
schools.isna().sum()| 0 | |
|---|---|
| ANON_ID | 0 |
| Birthdate | 0 |
| Gen | 0 |
| Eth | 0 |
| Fluency | 0 |
| SpEd | 0 |
| Grade | 0 |
| AttRate | 29 |
| DaysEnr | 29 |
| DaysAbs | 29 |
| Susp | 20556 |
| CurrWeightedTotGPA | 6441 |
| SED | 0 |
| School | 0 |
| Year | 0 |
| Age | 0 |
| ChroAbs | 0 |
We need to handle misisng values in columns AttRate,DaysEnr, DaysAbs,Susp, and CurrWeightedTotGPA
Suspension: Fill all missing values with 0, as students with no suspension records from schools are assumed to have none.
schools_cleaned = schools.copy()schools_cleaned['Susp'] = schools_cleaned['Susp'].fillna(0)
schools_cleaned.isna().sum()| 0 | |
|---|---|
| ANON_ID | 0 |
| Birthdate | 0 |
| Gen | 0 |
| Eth | 0 |
| Fluency | 0 |
| SpEd | 0 |
| Grade | 0 |
| AttRate | 29 |
| DaysEnr | 29 |
| DaysAbs | 29 |
| Susp | 0 |
| CurrWeightedTotGPA | 6441 |
| SED | 0 |
| School | 0 |
| Year | 0 |
| Age | 0 |
| ChroAbs | 0 |
AttRate, DaysEnr, and Days Abs: Some students have records for the following year, meaning missing values for these fields are likely from newly enrolled students. Some students appear only once, suggesting they did not stay enrolled the following year. Since the percentage of these students per school is low, drop those rows with missing values in these columns, as they do not provide meaningful information and do not significantly impact the overall dataset.
# Look at 29 students with NaN Values in AttRate, DaysEnr, and DaysAbs and how many times they appear in the dataset
ids = schools_cleaned[schools_cleaned['DaysEnr'].isnull()]['ANON_ID'].unique()
schools_cleaned[schools_cleaned['ANON_ID'].isin(ids)]['ANON_ID'].value_counts()| count | |
|---|---|
| ANON_ID | |
| 46788 | 3 |
| 74073 | 3 |
| 74195 | 3 |
| 45261 | 3 |
| 23826 | 3 |
| 76869 | 3 |
| 76885 | 3 |
| 67766 | 2 |
| 11215 | 2 |
| 17509 | 2 |
| 5645 | 2 |
| 37190 | 2 |
| 54615 | 2 |
| 31347 | 2 |
| 74203 | 2 |
| 54845 | 2 |
| 76930 | 1 |
| 44957 | 1 |
| 49492 | 1 |
| 49493 | 1 |
| 45784 | 1 |
| 58263 | 1 |
| 75819 | 1 |
| 60842 | 1 |
| 60070 | 1 |
| 51972 | 1 |
| 56537 | 1 |
| 73585 | 1 |
| 62638 | 1 |
# Look at one student
schools_cleaned[schools_cleaned['ANON_ID'] == 74195]| ANON_ID | Birthdate | Gen | Eth | Fluency | SpEd | Grade | AttRate | DaysEnr | DaysAbs | Susp | CurrWeightedTotGPA | SED | School | Year | Age | ChroAbs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2855 | 74195 | 2004-10-26 | M | Latino | RFEP | Not Special Ed | 10 | NaN | NaN | NaN | 0.0 | 0.00 | SED | Castlemont | 19-20 | 15 | 0 |
| 3737 | 74195 | 2004-10-26 | M | Latino | RFEP | Not Special Ed | 11 | 0.2333 | 180.0 | 138.0 | 0.0 | 0.00 | SED | Castlemont | 20-21 | 16 | 1 |
| 4621 | 74195 | 2004-10-26 | M | Latino | RFEP | Not Special Ed | 12 | 0.5056 | 180.0 | 89.0 | 0.0 | 1.75 | SED | Castlemont | 21-22 | 17 | 1 |
# Percentage of missing values with respect to school
schools_cleaned.groupby('School')['AttRate'].apply(lambda x: x.isna().mean())| AttRate | |
|---|---|
| School | |
| Castlemont | 0.002169 |
| East Oakland PRIDE | 0.001519 |
| Elmhurst United | 0.000401 |
| MLK | 0.001103 |
| McClymonds | 0.001949 |
| Prescott | 0.000000 |
| WOMS | 0.000645 |
# Drop the rows with those missing values of AttRate, DaysEnrolled, DaysAbs
schools_cleaned.dropna(subset=['AttRate', 'DaysEnr', 'DaysAbs'], inplace=True)
schools_cleaned.isna().sum()| 0 | |
|---|---|
| ANON_ID | 0 |
| Birthdate | 0 |
| Gen | 0 |
| Eth | 0 |
| Fluency | 0 |
| SpEd | 0 |
| Grade | 0 |
| AttRate | 0 |
| DaysEnr | 0 |
| DaysAbs | 0 |
| Susp | 0 |
| CurrWeightedTotGPA | 6434 |
| SED | 0 |
| School | 0 |
| Year | 0 |
| Age | 0 |
| ChroAbs | 0 |
CurrWeightedTotGPA:
All records for grades -1 through 5 have missing values in the GPA column, with a small percentage of missing values also present for grades 9, 10, and 11. Given the strong correlation between GPA and Absent Rate, as shown in the correlation heatmap in section 2, it’s essential to retain this column in a way that minimizes noise. We cannot assign a numerical GPA to records in grades -1 to 5, as arbitrary values could misrepresent GPA meaningfully.
To address this, I propose creating a new variable, Academic Status, which categorizes GPA into four levels:
Good: GPA ≥ 3.0 Average: 2.0 ≤ GPA < 3.0 At Risk: GPA < 2.0 Unknown: GPA is NaN
After creating Academic Status, we can drop the GPA column and explore the relationship between this new variable and the response variable, ChroAbs.
#Percentage of missing GPA in each grade
schools.groupby('Grade')['CurrWeightedTotGPA'].apply(lambda x: x.isna().mean())| CurrWeightedTotGPA | |
|---|---|
| Grade | |
| -1 | 1.000000 |
| 0 | 1.000000 |
| 1 | 1.000000 |
| 2 | 1.000000 |
| 3 | 1.000000 |
| 4 | 1.000000 |
| 5 | 1.000000 |
| 6 | 0.000000 |
| 7 | 0.000000 |
| 8 | 0.000000 |
| 9 | 0.001144 |
| 10 | 0.001181 |
| 11 | 0.000931 |
| 12 | 0.000000 |
# Define a function to categorize academic status based on the given criteria
def categorize_academic_status(gpa):
if pd.isna(gpa):
return "Unknown"
elif gpa >= 3.0:
return "Good"
elif 2.0 <= gpa < 3.0:
return "Average"
else:
return "AtRisk"
# Apply the function to create a new column 'AcademicStatus'
schools_cleaned['AcademicStatus'] = schools_cleaned['CurrWeightedTotGPA'].apply(categorize_academic_status)
schools_cleaned.drop(columns=['CurrWeightedTotGPA'], inplace=True)
schools_cleaned.head()| ANON_ID | Birthdate | Gen | Eth | Fluency | SpEd | Grade | AttRate | DaysEnr | DaysAbs | Susp | SED | School | Year | Age | ChroAbs | AcademicStatus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 338 | 2003-07-21 | F | Latino | RFEP | Not Special Ed | 9 | 0.9889 | 180.0 | 2.0 | 0.0 | Unknown | Castlemont | 17-18 | 14 | 0 | Good |
| 1 | 340 | 2003-03-31 | M | Latino | EL | Not Special Ed | 9 | 0.8389 | 180.0 | 29.0 | 2.0 | Unknown | Castlemont | 17-18 | 14 | 1 | AtRisk |
| 2 | 478 | 2003-09-06 | F | Multiple Ethnicity | EO | Special Ed | 9 | 0.7263 | 179.0 | 49.0 | 1.0 | Unknown | Castlemont | 17-18 | 14 | 1 | AtRisk |
| 3 | 686 | 2000-04-02 | M | Latino | EL | Not Special Ed | 12 | 0.9611 | 180.0 | 7.0 | 0.0 | Unknown | Castlemont | 17-18 | 17 | 0 | Average |
| 4 | 693 | 2002-03-28 | F | Latino | RFEP | Not Special Ed | 10 | 0.9889 | 180.0 | 2.0 | 0.0 | SED | Castlemont | 17-18 | 15 | 0 | Average |
schools_cleaned.isna().sum()| 0 | |
|---|---|
| ANON_ID | 0 |
| Birthdate | 0 |
| Gen | 0 |
| Eth | 0 |
| Fluency | 0 |
| SpEd | 0 |
| Grade | 0 |
| AttRate | 0 |
| DaysEnr | 0 |
| DaysAbs | 0 |
| Susp | 0 |
| SED | 0 |
| School | 0 |
| Year | 0 |
| Age | 0 |
| ChroAbs | 0 |
| AcademicStatus | 0 |
Missing Values: Before and After
# Suspension value before and after
plot_distribution_and_summary(schools, 'Susp', "Suspension Value Before Cleaning")
plot_distribution_and_summary(schools_cleaned, 'Susp', "Suspension Value After Cleaning")
# Attendance Rate Before Cleaning
plot_distribution_and_summary(schools, 'AttRate', "Attendance Rate Before Cleaning")
# Attendance Rate After Cleaning
plot_distribution_and_summary(schools_cleaned, 'AttRate', "Attendance Rate After Cleaning")
# Visualize Chronic Absenteeism by Academic Status
plt.figure(figsize=(10, 5))
plt.title('Counts of Chronic Absenteeism per Academic Status' ,fontweight='bold', color='darkblue')
sns.countplot(data=schools_cleaned, x='AcademicStatus', hue='ChroAbs', palette='Paired')
plt.grid(True, linestyle='--', alpha=0.5);
Summarization of Data Cleaning:
Suspension (Susp) had the highest proportion of missing values, with roughly 93% of entries lacking data. I filled these missing values with zeros, assuming that students with no suspension record had no suspensions.
For Current Weighted Total GPA (CurrWeightedTotGPA), All records for grades -1 through 5 have missing values in the GPA column, with a small percentage of missing values also present for grades 9, 10, and 11. Given the strong correlation between GPA and Absent Rate, as shown in the correlation heatmap in Section 2, it’s essential to retain this column in a way that minimizes noise. We cannot assign a numerical GPA to records in grades -1 to 5, as arbitrary values could misrepresent GPA meaningfully. To address this, I propose creating a new variable, Academic Status, which categorizes GPA into four levels: Good: GPA ≥ 3.0, Average: 2.0 ≤ GPA < 3.0, At Risk: GPA < 2.0, and Unknown: GPA is NaN. After creating Academic Status, I drop the GPA column and explore the relationship between this new variable and the response variable, ChroAbs.
I also found missing entries in Attendance Rate (AttRate), Days Enrolled (DaysEnr), and Days Absent (DaysAbs), affecting about 0.1% of the dataset with 29 records each. Some students have records for the following year, meaning missing values for these fields are likely from newly enrolled students. Some students appear only once, suggesting they did not stay enrolled the following year. Since the percentage of these students per school is low, drop those rows with missing values in these columns, as they do not provide meaningful information and do not significantly impact the overall dataset.
3b. Handle Duplicates and Irrelevant Data
# Frequency of values in each categorical column
# Generate HTML for each column's value counts
html_output = "<div style='display: flex; flex-wrap: wrap;'>"
cat_cols = ['Gen', 'Eth', 'Fluency', 'SpEd', 'Grade', 'SED', 'Age', 'ChroAbs']
for col in cat_cols:
# Get the value counts for the column
counts = schools[col].value_counts()
# Create an HTML table for the value counts of each column
html_output += f"""
<div style='margin-right: 20px; padding: 10px; border: 1px solid #ddd;'>
<h4>{col}</h4>
{counts.to_frame().to_html(header=False)}
</div>
"""
html_output += "</div>"
# Display all counts side by side in HTML format
display(HTML(html_output))Gen
| Gen | |
|---|---|
| M | 11857 |
| F | 10113 |
| N | 24 |
Eth
| Eth | |
|---|---|
| Latino | 10645 |
| African American | 8335 |
| White | 840 |
| Asian | 753 |
| Not Reported | 526 |
| Multiple Ethnicity | 424 |
| Pacific Islander | 358 |
| Native American | 63 |
| Filipino | 50 |
Fluency
| Fluency | |
|---|---|
| EO | 10653 |
| EL | 8281 |
| RFEP | 2688 |
| IFEP | 266 |
| TBD | 100 |
| Unknown | 6 |
SpEd
| SpEd | |
|---|---|
| Not Special Ed | 18144 |
| Special Ed | 3850 |
Grade
| Grade | |
|---|---|
| 9 | 2623 |
| 10 | 2540 |
| 7 | 2189 |
| 6 | 2182 |
| 8 | 2168 |
| 11 | 2148 |
| 12 | 1711 |
| 1 | 1089 |
| 3 | 1065 |
| 2 | 1036 |
| 4 | 1015 |
| 0 | 980 |
| 5 | 940 |
| -1 | 308 |
SED
| SED | |
|---|---|
| SED | 14363 |
| Unknown | 6317 |
| Not SED | 1314 |
Age
| Age | |
|---|---|
| 16 | 2267 |
| 13 | 2253 |
| 12 | 2236 |
| 17 | 2176 |
| 15 | 2142 |
| 14 | 2134 |
| 11 | 1785 |
| 7 | 1036 |
| 8 | 1033 |
| 6 | 1020 |
| 9 | 1017 |
| 10 | 981 |
| 5 | 814 |
| 18 | 798 |
| 19 | 161 |
| 4 | 96 |
| 20 | 41 |
| 21 | 4 |
ChroAbs
| ChroAbs | |
|---|---|
| 0 | 11882 |
| 1 | 10112 |
Gender: has one unsual value of ‘m’. Modify it to “M”
# Gender has one unsual value of 'm'. Modify it to "M"
schools['Gen'] = schools['Gen'].replace('m', 'M')
schools['Gen'].value_counts()| count | |
|---|---|
| Gen | |
| M | 11857 |
| F | 10113 |
| N | 24 |
Fluency:
There are 100 records labeled as TBD and 6 asUnknown in the Fluency column. Upon examining records with TBD, we find that these are primarily new enrollees. For students who reappear in subsequent years, their Fluency values are recorded in later years. Based on this insight, I fill Unknown or TBD entries in Fluency with matching values from other records with the same ANON_ID. After this process, 68 TBD and 2 Unknown entries remain, representing students who appear only once in the dataset—likely indicating that they did not re-enroll. Even though these students account for only 3% of the dataset, they still provide valuable information in other variables, such as Attendance Rate, SED, SpEd, etc. Therefore, I’ve decided to retain these records rather than dropping them. However, to simplify our model, I will consolidate TBD and Unknown into a single label, Unknown, in the Fluency column. This will help reduce model complexity while preserving the information these records offer.
# Look at the records that have Unknown in Fluency
schools[schools['Fluency'] == 'Unknown']| ANON_ID | Birthdate | Gen | Eth | Fluency | SpEd | Grade | AttRate | DaysEnr | DaysAbs | Susp | CurrWeightedTotGPA | SED | School | Year | Age | ChroAbs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4662 | 78297 | 2004-11-24 | M | Latino | Unknown | Not Special Ed | 10 | 0.9892 | 93.0 | 1.0 | NaN | 3.63 | Not SED | Castlemont | 21-22 | 17 | 0 |
| 4663 | 78298 | 2004-11-24 | M | Latino | Unknown | Not Special Ed | 10 | 0.9892 | 93.0 | 1.0 | NaN | 3.63 | Not SED | Castlemont | 21-22 | 17 | 0 |
| 5159 | 44057 | 2008-01-15 | F | Latino | Unknown | Not Special Ed | 9 | 0.5122 | 41.0 | 20.0 | NaN | 0.00 | Unknown | Castlemont | 22-23 | 14 | 1 |
| 12481 | 5911 | 2009-10-17 | F | Latino | Unknown | Not Special Ed | 7 | 0.7385 | 65.0 | 17.0 | NaN | 3.00 | Unknown | Elmhurst United | 22-23 | 13 | 1 |
| 15905 | 50958 | 2014-03-23 | M | White | Unknown | Not Special Ed | 1 | 0.9063 | 64.0 | 6.0 | NaN | NaN | Not SED | MLK | 21-22 | 7 | 0 |
| 20077 | 31810 | 2017-02-25 | M | Not Reported | Unknown | Not Special Ed | -1 | 0.0833 | 12.0 | 11.0 | NaN | NaN | SED | Prescott | 21-22 | 4 | 1 |
# Examine one student
schools[schools['ANON_ID'] == 78297]| ANON_ID | Birthdate | Gen | Eth | Fluency | SpEd | Grade | AttRate | DaysEnr | DaysAbs | Susp | CurrWeightedTotGPA | SED | School | Year | Age | ChroAbs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4662 | 78297 | 2004-11-24 | M | Latino | Unknown | Not Special Ed | 10 | 0.9892 | 93.0 | 1.0 | NaN | 3.63 | Not SED | Castlemont | 21-22 | 17 | 0 |
| 5571 | 78297 | 2004-11-24 | M | Latino | EL | Not Special Ed | 11 | 0.9056 | 180.0 | 17.0 | NaN | 3.43 | SED | Castlemont | 22-23 | 18 | 0 |
| 6442 | 78297 | 2004-11-24 | M | Latino | EL | Not Special Ed | 12 | 0.8833 | 180.0 | 21.0 | NaN | 2.40 | SED | Castlemont | 23-24 | 19 | 1 |
# Function to fill 'Unknown' or 'TBD' in 'Fluency' with matching 'ANON_ID' values
def fill_fluency(row):
if row['Fluency'] in ['Unknown', 'TBD']:
# Find the first non-null value of 'Fluency' for the same 'ANON_ID'
matching_value = schools_cleaned[(schools_cleaned['ANON_ID'] == row['ANON_ID']) & (schools_cleaned['Fluency'].notna()) & (~schools_cleaned['Fluency'].isin(['Unknown', 'TBD']))]
# If there is a match, return its 'Fluency' value; otherwise, return the original 'Fluency'
if not matching_value.empty:
return matching_value.iloc[0]['Fluency']
return row['Fluency']
# Apply the function to the 'Fluency' column
schools_cleaned['Fluency'] = schools_cleaned.apply(fill_fluency, axis=1)schools_cleaned['Fluency'].value_counts()| count | |
|---|---|
| Fluency | |
| EO | 10638 |
| EL | 8308 |
| RFEP | 2683 |
| IFEP | 266 |
| TBD | 68 |
| Unknown | 2 |
print('Percentage of TBD and Unknown left in the dataset: ', 70/schools_cleaned.shape[0])Percentage of TBD and Unknown left in the dataset: 0.0031868882312770315
# Examine the students left with TBD values in Fluency
schools_cleaned[schools_cleaned['Fluency'] == 'TBD']| ANON_ID | Birthdate | Gen | Eth | Fluency | SpEd | Grade | AttRate | DaysEnr | DaysAbs | Susp | CurrWeightedTotGPA | SED | School | Year | Age | ChroAbs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 870 | 2000-12-23 | M | Latino | TBD | Not Special Ed | 9 | 0.8761 | 113.0 | 14.0 | 0.0 | 1.82 | Unknown | Castlemont | 17-18 | 17 | 1 |
| 52 | 3933 | 2002-12-27 | F | Latino | TBD | Not Special Ed | 9 | 0.9344 | 122.0 | 8.0 | 0.0 | 2.94 | Unknown | Castlemont | 17-18 | 15 | 0 |
| 53 | 3936 | 2002-11-16 | M | Latino | TBD | Not Special Ed | 9 | 0.9426 | 122.0 | 7.0 | 0.0 | 1.76 | Unknown | Castlemont | 17-18 | 15 | 0 |
| 201 | 14335 | 2000-03-04 | M | Latino | TBD | Not Special Ed | 9 | 0.6889 | 45.0 | 14.0 | 0.0 | 0.00 | Unknown | Castlemont | 17-18 | 17 | 1 |
| 251 | 17703 | 2001-01-22 | M | Latino | TBD | Not Special Ed | 9 | 1.0000 | 9.0 | 0.0 | 0.0 | 0.00 | Unknown | Castlemont | 17-18 | 16 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20067 | 26375 | 2016-11-21 | F | Latino | TBD | Not Special Ed | -1 | 0.8788 | 33.0 | 4.0 | 0.0 | NaN | Not SED | Prescott | 21-22 | 5 | 1 |
| 20068 | 26376 | 2016-12-21 | F | Latino | TBD | Not Special Ed | -1 | 0.8788 | 33.0 | 4.0 | 0.0 | NaN | Not SED | Prescott | 21-22 | 5 | 1 |
| 20548 | 35013 | 2006-06-15 | M | Not Reported | TBD | Not Special Ed | 6 | 0.6667 | 6.0 | 2.0 | 0.0 | 0.00 | Unknown | WOMS | 17-18 | 11 | 1 |
| 20549 | 35015 | 2003-11-02 | F | Not Reported | TBD | Not Special Ed | 7 | 0.6667 | 6.0 | 2.0 | 0.0 | 0.00 | Unknown | WOMS | 17-18 | 14 | 1 |
| 20904 | 1962 | 2007-11-02 | M | Asian | TBD | Not Special Ed | 6 | 1.0000 | 5.0 | 0.0 | 0.0 | 0.00 | Not SED | WOMS | 19-20 | 12 | 0 |
68 rows × 17 columns
# Replace all 'TBD' left as 'Unknown' in the dataset
schools_cleaned['Fluency'] = schools_cleaned['Fluency'].replace('TBD', 'Unknown')
schools_cleaned['Fluency'].value_counts()| count | |
|---|---|
| Fluency | |
| EO | 10638 |
| EL | 8308 |
| RFEP | 2683 |
| IFEP | 266 |
| Unknown | 70 |
SED
From the Theil’s U correlation heatmap in section 2, I observed a strong predictive relationship between Year and SED. In this section, I’ll explore the reason behind this relationship. By plotting the distribution of SED values across different years, I found that in the two school years 2017-18 and 2018-19, SED values were consistently labeled as Unknown. This finding aligns with the observed relationship between Year and SED in the heatmap.
Upon further investigation, examining students with Unknown values in SED revealed a pattern: these students often have this column filled with meaningful information in the following school years. Following this pattern, I applied the same process to fill in Unknown values for SED where possible. Although some Unknown values remain, we can retain these records as they still contribute valuable information to other variables.
# Plot SED counts per Year
plt.figure(figsize=(10, 5))
sns.countplot(data=schools, y='Year', hue='SED', palette='Paired')
plt.title('SED Counts per Year', fontweight='bold', color='darkblue')
plt.grid(True, linestyle='--', alpha=0.5)
# Function to fill 'Unknown' 'SED' with matching 'ANON_ID' values
def fill_SED(row):
if row['SED'] == 'Unknown':
# Find the first non-null value of 'SED' for the same 'ANON_ID'
matching_value = schools_cleaned[(schools_cleaned['ANON_ID'] == row['ANON_ID']) & (schools_cleaned['SED'].notna()) & (~schools_cleaned['SED'].isin(['Unknown']))]
# If there is a match, return its 'SED' value; otherwise, return the original 'SED'
if not matching_value.empty:
return matching_value.iloc[0]['SED']
return row['SED']
# Apply the function to the 'SED' column
schools_cleaned['SED'] = schools_cleaned.apply(fill_SED, axis=1)Duplicates and Irrelevant: Before and After
#SED: Before and after
fig, axs = plt.subplots(
nrows=1, ncols=2,
figsize=(24, 10)
)
sns.countplot(data=schools, y='Year', hue='SED', palette='Paired', ax=axs[0])
axs[0].set_title('SED Counts per Year Before Cleaning', fontweight='bold', color='darkblue')
axs[0].grid(True, linestyle='--', alpha=0.5)
sns.countplot(data=schools_cleaned, y='Year', hue='SED', palette='Paired', ax=axs[1])
axs[1].set_title('SED Counts per Year After Cleaning', fontweight='bold', color='darkblue')
axs[1].grid(True, linestyle='--', alpha=0.5)
plt.show()
# Theil's U Heat Map before and After
cat_cols_cleaned = ['Gen', 'Eth', 'Fluency', 'SpEd', 'Grade', 'SED', 'Age', 'Year', 'School','AcademicStatus', 'ChroAbs']
# Create an empty DataFrame to store Theil's U values
theils_u_matrix_cleaned = pd.DataFrame(index=cat_cols_cleaned, columns=cat_cols_cleaned)
# Fill the matrix with Theil's U values
for col1 in cat_cols_cleaned:
for col2 in cat_cols_cleaned:
if col1 == col2:
theils_u_matrix_cleaned.loc[col1, col2] = 1 # Perfect association with itself
else:
theils_u_matrix_cleaned.loc[col1, col2] = theils_u(schools_cleaned[col1], schools_cleaned[col2])
# Convert to float for heatmap compatibility
theils_u_matrix_cleaned = theils_u_matrix_cleaned.astype(float)
fig, axs = plt.subplots(
nrows=1, ncols=2,
figsize=(24, 10)
)
# Plot the heatmap
sns.heatmap(theils_u_matrix_cleaned, annot=True, cmap="YlGnBu", square=True, ax=axs[1])
axs[1].set_title("Theil's U Heatmap of Categorical Variables After Cleaning")
sns.heatmap(theils_u_matrix, annot=True, cmap="YlGnBu", square=True, ax=axs[0])
axs[0].set_title("Theil's U Heatmap of Categorical Variables Before Cleaning")
plt.show()
Observation: After modifying the Fluency and SED columns, I observed that Fluency retains its predictive power on Chronic Absenteeism, while SED shows a significant decrease in predictive strength (dropping from 0.0035 to 0.00035). One possible reason is that replacing Unknown values in SED diluted the relationship between Year and SED—since school years 2017-18 and 2018-19 no longer consist exclusively of Unknown values, SED no longer inherits the predictive power of Year on Chronic Absenteeism (where Year has the strongest predictive power).
Additionally, after transforming GPA into the categorical variable Academic Status, the heatmap now includes this new variable. Academic Status shows a strong predictive power on Chronic Absenteeism, with a correlation of 0.05, second only to Year. This aligns with the Pearson correlation heatmap observation where GPA and Attendance Rate had a strong correlation. Furthermore, Academic Status exhibits moderate predictive power on School, likely because elementary schools (covering grades -1 to 5) only have Unknown values in Academic Status.
4. Preprocessing for Machine Learning
Note on Selecting Variables for Our Machine Learning Model
Exclude Personal Identifiers: Omit ANON_ID and Birthdate as these are personal identifiers that do not contribute to predicting Chronic Absenteeism.
Exclude Redundant Attendance Metrics: Exclude AttRate, DaysEnr, and DaysAbs due to their strong dependency on one another, with Chronic Absenteeism (response variable ChroAbs) already capturing this information effectively.
Encoding for Categorical Variables: Use one-hot encoding for all categorical variables, as they are nominal (non-ordinal). Label encoding would introduce artificial order, which is unnecessary and could mislead the model. Additionally, we’ll select a subset of categorical variables with high predictive power and a manageable number of unique categories to reduce model complexity.
Exclude Weak Predictors: Exclude Gender and SED due to their minimal predictive power on Chronic Absenteeism.
Address Potential Bias: Exclude Ethnicity to avoid potential racial bias. Retain Fluency, as it has a stronger predictive relationship with Chronic Absenteeism and moderate predictive power on Ethnicity.
Simplify by Reducing Redundancy: Exclude Grade and Age as they have numerous unique values. Instead, retain School, which is strongly associated with both and provides similar information with fewer categories.
Final Variable Selection: AcademicStatus, Year, School, Fluency, and Special Ed as they each exhibit strong predictive power for Chronic Absenteeism and will support a more interpretable model.
5. Next Steps
X_col = ['Fluency','SpEd','SED', 'Grade', 'SED', 'Year', 'School', 'AcademicStatus']
y_col = 'ChroAbs'from sklearn.model_selection import train_test_split
X = df_transformed
y = schools_cleaned[y_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)Next Steps:
- Train a classification model using X_train and y_train.
- Evaluate its accuracy, precision, recall, and other metrics on X_test, y_test.
Thank you for reading! This completes the data exploration phase and sets the stage for more detailed modeling and intervention planning.