CMSC320 Final Project -- Women and Minorities in STEM

Written by: Jakob Wachter

Welcome, ye weary traveller! It appears you have decided to learn the art of data science! Join me as we decided to go through the data science pipeline. This pipeline consists of five distinct phases:

In order to travel through the entirety of the data science pipeline, we motivate ourselves with one simple question: Are women and minorities underrepresented in STEM, and if so, by how much? We will try to find an answer to this question by going through the data science pipeline bit-by-bit, and hopefully our analyses will help us draw a meaningful conclusion.

*Do note that for a majority of this document, basic knowledge of Python is assumed. Understanding of the language is mitigated as much as possible, but it may help clarify certain actions.

Data Collection

The first step of the data science pipeline is that of data collection. In effect, data collection is the process of compiling large amounts of information regarding our topic of interest, so that we may perform analysis and insight on it later on.

The source of our data on this topic is the High School Longitudinal Study of 2009 (HSLS:09), which is a survey of over 20,000 high school age students. The students in this survey respond to a variety of questions, including sex, race, familial relationship, current academic ability, and subjective questions relating to that student's experience. This data is provided by the US National Center for Education Statistics (NCES), which is an authoritative source on public educational data for the United States.

Longitudinal refers to the fact that the data is collected over a long period of time--this data begins with the students in ninth grade and follows their educational career well into their tertiary education.

We will begin by loading in all of the necessary libraries that we will need to perform our collection. Our primary tool for this section, and generally throughout our analysis, is the pandas library in Python. Pandas provides a fantastic way to utilize objects known as DataFrames, which are headered tables that carry different data types in each column. It is fast, easy-to-use, and very user friendly. You can read more about pandas here. pandas relies in numpy in its calculations, so we pre-load it in case we want to use any numpy functions.

With that out of the way, let us now import the HSLS:09 study. This data is included on my computer in the file hsls_17_student_pets_sr_v1_0.csv; csv stands for "comma separated values" and is a very common format for raw data collection. If one would like to download this data for themselves, here is how to do it:

  1. Head to this website, agree to the terms of use, and then hit "close" on the popup that appears.
  2. Click on the only link under the header HSLS.
  3. Click "Download" on the right side and select "R".
  4. In the first box that says "R Formatted Data Files", click and scroll down to the option that says "CSV".
  5. Click on the first download link that appears.

Be wary if you would like to download this data for your home computer: the full .csv file contains 888.2MB of information when unzipped! Reading this file can take a while on just about any computer out there.

So we have our data. Note that this data is over 9000 columns long, which is a ridiculous amount of information! First, what I would like to do is drop (database terminology for getting rid of) all of the columns that involve information regarding replicate weights.

Replicate weights essentially allow for one to derive standard errors from survey responses. While this is an important way to ensure results are slightly more accurate, it is unnecessary for our purposes and adds many thousands of columns of data, so we will drop it.

(Every row after 4014 was a replicate weight, which is why the number was chosen).

With this load and change, we have completed the data collection process. Of course, other interests or studies may have required a broader range of tools that we would need to use. Some of these tools include:

Of course, our data exists in pandas DataFrame format, but that does not mean that it is entirely useful just yet. This, of course, leads us to...

Data Processing

We now move onward to the task of trying to process the data in a way that makes it more "palatable" to a human reader. What we'd like to do is extract the useful information from these 4000 columns to make sure we specifically highlight the aspects we're trying to look at. This way, analysis of these factors comes more quickly to us in the future.

Our first task in the data processing step is to get rid of all of the supressed columns in the dataset. The NCES website mentions that the public data available to anyone from the HSLS:09 withholds some of the information given in the survey response, like social strata, what college a student went to, their school code, and other similar information.

Thankfully, the NCES is extremely thorough in their documentation and has helpfully included a "layout" file that tells us what each column means and what values it can take. Columns where the data is surpressed simply read -5 for all values, which tells us that the data was removed specifically for the public release. We will move forward with getting rid of all of these rows now.

So now we've gotten rid of supressed data, which thankfully helps a lot with thinning out the columns. Unfortunately, there are still over 3000. This is where we must get selective with what we keep and what we don't.

The NCES layout file of the HSLS:09 appears to categorize the variables into distinct categories by the first letter of their variable name:

In order to "thin the herd", so to speak, we can get rid of the P, M, N, A, and C series variables. These are important metrics for academic success, of course, but we're focused more on the specific question that involves how a student's sex and ethnicity affects their scholastic achievement. As a consequence, we are safe to remove these variables.

Let us do that now:

Fantastic! We've removed a good chunk of the columns and are now down to less than half of what we started with. But that still leaves us with almost 2000 columns to play with--so which ones are useful to us?

This is where hand-combing through the rest will take us. We've already performed systematic removal of a lot of variables, so now we can start playing with going through one-by-one and looking for ones that interest us. Specifically, I'd like the variables to answer two very specific questions:

  1. Does sex impact STEM representation? If so, why?
  2. Does ethnicity or race impact STEM representation? If so, why?

For this reason, I decided to keep only explicit data on sex and ethnicity, as well as a small set of survey questions that I thought sounded interesting and primarily gauge the student's perceptions of STEM and academic success. Unfortunately, we cannot use gender data, as it was surpressed in the public set.

These specific variables are listed in the arrays in the code below. We now filter specifically these columns. Explanations of the different variables will come when they are of importance to us--for now, rest assured that I have spent quite some time in picking out interesting variables for us to use.

Now we have a database with a bunch of data and a curated list of important and useful columns, which we can now use to do exploratory analysis.

"Great!", you may be saying at this very moment. But what about the rows? We haven't done any useful analysis on those. All we want is to remove rows where race or sex is not present. According to the HSLS:09 layout, this corresponds to a value in that row of -9; so let's remove all rows with that value.

This gets rid of a final 1000 rows that wouldn't be able to help us answer our question, as their sex and race is unknown. pandas gives us a warning, but that's something we can ignore moving forward, since we're not going to be using the original database any more. With this, we can finally move forward with performing analysis on our data. And we can also rest assured that it is nice and tidy!

Exploratory Analysis

This is the intermediate phase of the data science pipeline. We are moving away from the data wrangling portion and towards the analysis and testing portion of the pipeline. This is usually the more fun part on the side of the data scientist, as one stops futzing with huge databases and starts playing with data.

Now is as important of a time as ever to note that although we are traversing what is known as the data science pipeline, that doesn't necessarily mean the pipeline is linear. A better analogy is to think of working in data science as like solving an escape room puzzle. Often times one will need to go back, look at old clues and sections of the "room", before they can move forward to the next step. Don't be afraid to go back and edit your database should the need arise!

The point of this section is mainly to get creative and see if you observe any interesting trends. To that end, we'll be performing five different analyses, to see if there's anything interesting worth noting. We will go through them one by one. The point of this isn't to create a research paper out of each one, merely to think of interesting ideas that utilize the data we've been working on and to see if there are any interesting ideas to explore. It's okay if an idea ends up being a dead end or uninteresting!

Analysis 1: High School Graduates

These ideas fall roughly in the order in which I thought to explore them. My immediate thought was to attempt to see how many ninth graders ended up graduating from high school, and to examine how this number changed across race and sex. This way we can see if the United States has any notable discrepancies.

This involves the usage of the following variables:

To perform this analysis, we will generate two tables that tally the type of high school completion by race. One will be for men, and the other for women. We will also calculate the proportion of students of each race and sex to see how well they do within their race. Let us do this now:

We note a couple interesting points here, which could be worth expounding upon:

Do you notice any other interesting trends?

Analysis 2: Enrollment Rate over Time

My next idea was to determine the retention rate of students by sex and ethnicity. To do this, we will use the following variables. Note that this analysis comes a bit more subjectively, since the metrics are not measuring exactly the same thing every time.

As you can notice, these don't measure the same metric every time. Unfortunately, since the survey and questions were different at each sample point, the retention rate measurement can't be perfect or equivalent. Hence, I decide to do the following:

  1. For the first sample point, all races and sexes are at 100% retention rate, since it is assumed they are in ninth grade automatically.
  2. For the second sample point, use the enrollment stat to determine retention rate.
  3. For the third sample point, subtract the number of dropped students from the total number to get the current enrollment rate.
  4. For the fourth sample point, use the application stat to determine retention rate.
  5. For the fifth sample point, use anyone who had received at least an Associates' degree.

I chose these specific points for their flexibility regarding gap years, but also to track which students went on to pursue college. Ideally they will give us a rough idea as to how many students are actively pursuing school at each survey point.

Note that the fifth survey occurred towards the end of the entire study, so the data from X5HIGHDEG is likely underrepresentative of the proportion of the entire sample that achieved an associates' degree.

Given the type of analyses we performed earlier, do you notice anything interesting in these findings? Any surprises? Think about what you might want to test or examine further, having looked at these numbers. What stands out to you?

Analysis 3: Involvement with STEM

My third analysis, and the first of my original analysis ideas, is to see what involvement people of a given sex and gender have with the STEM discipline. Does it appear that a given race or sex is more attracted to STEM than another? To answer this question, we use the following variables:

What I'd like to do is create two different tables, that highlight the following info:

  1. Do certain races or sexes have a higher amount of STEM degrees, and by how much?
  2. Do certain races or sexes have a higher GPA in STEM classes than others, and how wide is the disparity if it exists?

To do this, we will look at the number of degrees accredited by race and use those points to find a proportion of a given race that gets a certain degree. We can compare the GPA of students by trying to find the mean GPA of the race/sex for their STEM classes and for every class.

Here we use the mean, which is a measure of central tendency: other examples of measures that could work appropriately include the median. Here we use the mean because I am making the assumption that the data for GPAs is roughly normally distributed within each race and sex, which means that the probability of having a GPA above or below the mean falls off in roughly equal proportion. Hence, the mean is not overly affected by outliers and can be used to earmark what an "average" GPA for a given race or sex might look like.

Analysis 4: Are they Gifts?

The last two analyses are not quite as in depth. They were simply curiosities of mine that happened to arise while thinking about the previous 3 analyses. For this analysis, I am curious to see whether or not students who are naturally gifted at math (or, on the contrary, are not good at it) believe that the ability to do mathematics is a skill or a trait. To do this, the following variables are used:

Here we are interested in seeing whether or not the response correlates with the self-efficacy of the student. Also, I am interested in seeing how sex correlates with the variables.

There are many more variables that I was unable to perform analyses on, but that were of curiosity to me. See if you can think of things to do with them! The ones I did not use are numerous and test a variety of aspects of a student's life and beliefs. You can find out what they mean and use them by doing the following:

  1. Perform the same steps 1-4 as were needed to download the raw csv dataset we are working with.
  2. Download the "Codebook and Layout" instead.
  3. The variable names and meanings are located in the file Layout_STUDENT.txt. You can use Ctrl+F on your keyboard to search for certain variables and find out what they mean.

Try to think of some interesting tests that may yield surprising results. Consider the following analysis, if you are hesitant to try to find something new:

Extra Analysis: Is there Bias?

Suppose that you want to see what the students thought about racial and gender bias amongst their professors. This involves the usage of the following variables:

Think about what type of categories (combinations of race/sex) that might produce interesting results for you to look at. What kinds of questions might you be able to ask from your final table? Do you think plotting it might give you something visually interesting?

With our analyses complete, we now move to doing something pretty with our analysis!

Data Visualization

Now that we have successfully performed a variety of analyses on our data, it is in our best interest to get a visual representation of what we've been doing this whole time! The whole point of this is to make pretty pictures and see if any trends appear visually that aren't immediately obvious from the tables and data provided above.

Note that normally this step and the previous are done in tandem; usually once one has completed some bit of exploratory analysis that yields them a subset of the data or an interesting table, they move straight ahead towards plotting it looking for visually interesting characteristics. And, after all, who doesn't like looking at pretty pictures?

I have decided to split these two sections up specifically to recognize that these are two different ways of tackling the problem of understanding and looking for trends in your data. A good data scientist takes a hybrid approach in this matter.

Analysis 1

For analysis 1, our culminating result was a table giving the proportion of students from a given race that had one of five high school outcomes--graduation, a GRE or high school equivalent, dropping out, continuing education or unknown.

The best way I could think of representing this data was by using a bar chart, to see how each type of high school outcome stacked against one another.

One can make nice plots of pandas data by using the matplotlib library in Python. We will specifically be using the pyplot package of matplotlib, but the entire library is extensive and extremely useful! You can find more information about it at this link.

Visually, our plot tells us much the same story as our original analysis--however, it makes the job of analyzing that data much easier, and quite interesting! Hopefully now you understand why the two are usually performed in tandem.

Analysis 2

Part of data science is being able to effectively communicate the information that you have synthesized in your analyses. A critical part of that is ensuring that you are using the proper type of graph for the data that you have on hand. For example, I could have done a scatter plot for Analysis 1, but it would be almost meaningless--you would end up leaving the graph with more questions than answers, some of them relating to my well-being and sanity. However, since I chose a bar graph, you likely got a nice visual look at what was going on with the data. Great! That is the goal.

We cannot simply approach every graph we have with a bar chart, however. Since our data for Analysis 2 involved the change in retention rates over time and involves a large volume, we can use a stack plot to succinctly communicate this information.

This plot told us something that simply looking at the data did not: what proportion of a race is at a given level compared to another. We note that the populations decline dramatically after high school ends, as many students choose not to attend college.

We also note that white students make up more than half of the initial data, and yet appear roughly equal (if not less) than the number of minority students that apply to colleges. We could have chosen a line graph to also represent this information, or a pie chart if we wanted to look at a certain point in time.

Analysis 3

Here our data gave us the average GPA and number of students enrolled in college for each race. Let us make two different plots: one that looks at how the difference between STEM and overall GPA varies between students, and another for the variation between men and women in STEM degree attainment.

Now it's your turn! Do you think that the data from Analysis 4, the Extra Analysis, or your own analysis could produce a visually interesting image? Try to visualize the data you've created and

A couple questions you should be asking yourself along the way:

Hypothesis Testing

With our data visualization and exploratory analysis complete, let's now move forward with performing hypothesis testing.

Remember how previously we had been asking questions about our data and trying to look for visual and numerical trends? In effect, hypothesis testing is our way of quantifying how much those trends subvert our expectations, and if there really is something interesting lurking beneath all of the numbers.

So, how does one perform a hypothesis test? Well, it all starts with a hypothesis. We need to formulate the question we want to answer first before we can answer it. The questions I'd like to ask are the following two, which should give the new data scientist an idea of what kinds of questions are normally asked at this phase.

  1. Does your race and sex have an influence on your graduation rate from high school?
  2. Does your mathematical self-efficacy correlate with your belief in mathematics as a skill?

Two answer these questions, we will perform two different types of hypothesis tests: Chi-squared tests, which measure for differences in population amongst groups, suitable for the first test; and Linear regression analysis, which looks for linear trends in data and determines whether or not a significant correlation exists between the two. Let us answer the first question now.

Chi Squared Testing

In order to perform our Chi-Squared tests, we will be using the scipy library in Python. You can read more about its documentation at this link.

Let us now use scipy to test the hypothesis that race influences one's high school graduation rate. Here, we make an assumption, called the null hypothesis, that we would like to see whether or not we can reject. Our null hypothesis in the case of chi-squared testing is that all of the data should occur at the same rate. Think about it--if our education system was ideal, this would be true, and all students would have the same graduation rate regardless of upbringing. So, let's test it now:

Here we find that there is a significant correlation between the race of a student and the probability that they will graduate. Hence, it appears that our chi-squared test can reject the null hypothesis, and the race of a student directly affects their graduation rate.

Linear Regression Modeling

Instead of scipy for this test, we will instead be using statsmodels, a robust linear regression library for Python. You can read more about its documentation at this link. Both are great tools for performing statistical analysis on your data!

You may be asking--what is a linear regression? In effect, a linear regression is just like the line of best fit that you may have seen in algebra 1 or statistics, but generalized to support any number of variables.

We can then test, in our specific example, whether or not there is a correlation between the self-efficacy, race, and sex of a student, and whether or not that student believes that mathematics is a skill that can be taught to them. Unfortunately, I do not have enough time to run this analysis now, but one can look at this specific documentation page to get a peek at what kinds of linear regression modelling can be done.

Machine Learning

Although this tutorial does not get into the very complicated subject of machine learning, ML is still an extremely valuable tool to the data scientist.

Machine Learning, at its core, is the idea that one can create a model that will detect patterns in data and utilize those patterns to make predictions. These models can then be used to look at new data points and evaluate certain things about them. For example, suppose we have finished our hypothesis testing, and want to determine the following question:

Given certain characteristics about a student and their high school education, can we predict whether or not that student will graduate from college?

We could then decide on a specific set of variables that we think make for good training data. Since a good model is motivated with sound reasoning, we need to argue for our individual choices.

As you can hopefully see, the decision space of variables for a machine learning model, especially in such an example as this, has to be carefully curated before moving forward. Understand the framework in which you are using a machine learning model, what it will be used for, and why you are letting the model train on an individual variable.

You can read more about the kinds of questions scientists are asking about machine learning ethics in this article by Nature. Additionally, one can read more about the ways in which machine learning algorithms can and will discriminate if metrics are not carefully regulated.

Once one has decided to move forward with creating a machine learning model, there are a variety of tools available to the user. One of the most popular is scikit-learn, a library designed specifically for machine learning models. You can find documentation for scikit-learn at this website. It is very extensive--for the budding data scientist, I recommend looking at the tutorials.

Insight & Conclusions

We have completed our analysis of the data, and I imagine that you, intrepid reader, have learned a thing or two about data science principles! Unless you are my TA, of course, in which case you almost certainly knew all of this stuff.

Quickly, let's recap what we learned about data science and our data along the way:

So, what have we learned? Well, it appears that race does have a significant effect on graduation rates and STEM achievement of a student; I am willing to be that, with more hypothesis testing, we can make the same conclusions about sex, and if we had the data, gender identity.

So, now where do we go from here? We've completed our data science, but there's still more to be done. If you are particularly interested, you can look into ways of doing more analysis on this data, or maybe you would like to compile a large list of data to perform policy solutions. The world is your oyster at this point. Data Science is primarily a tool that can help you, rather than being its own science! It is interdisciplinary by nature!

And hopefully you, future data scientist, have learned much about what data science is, and how to do it! I look forward to seeing your own project soon enough.