Welcome, ye weary traveller! It appears you have decided to learn the art of data science! Join me as we decided to go through the data science pipeline. This pipeline consists of five distinct phases:
Data Collection: Oftentimes the first part of a data science project involves finding the data that you need in the first place. Data comes in varying degrees of "cleanliness", so it is often the job of the data scientist to gather data in a way that aids further analysis.
Data Processing: Data comes in varying degrees of cleanliness and can sometimes take a long time to sift through. Once data is collected, one often needs to go through a processing phase whereby any analysis they perform comes (ideally) free of headaches.
Exploratory Analysis/Data Visualization: Once the data has been collected and processed, it's now up to the data scientist to decide what to do with it. This can involve looking for trends in data, creating interesting graphs, and trying to piece together ideas or testable beliefs (hypotheses).
Hypothesis Testing/Machine Learning: After coming up with a set of hypotheses during the exploratory analysis phase, it becomes pertinent to test them! These tests can tell us whether or not trends exist in the data or produce models for what future data might be.
Insight/Conclusions: Data science simply isn't pragmatically important unless we can make a conclusion about what our results mean. Otherwise, we're sitting in the dark with a bunch of cool tools and no practical purpose for them.
In order to travel through the entirety of the data science pipeline, we motivate ourselves with one simple question: Are women and minorities underrepresented in STEM, and if so, by how much? We will try to find an answer to this question by going through the data science pipeline bit-by-bit, and hopefully our analyses will help us draw a meaningful conclusion.
*Do note that for a majority of this document, basic knowledge of Python is assumed. Understanding of the language is mitigated as much as possible, but it may help clarify certain actions.
The first step of the data science pipeline is that of data collection. In effect, data collection is the process of compiling large amounts of information regarding our topic of interest, so that we may perform analysis and insight on it later on.
The source of our data on this topic is the High School Longitudinal Study of 2009 (HSLS:09), which is a survey of over 20,000 high school age students. The students in this survey respond to a variety of questions, including sex, race, familial relationship, current academic ability, and subjective questions relating to that student's experience. This data is provided by the US National Center for Education Statistics (NCES), which is an authoritative source on public educational data for the United States.
Longitudinal refers to the fact that the data is collected over a long period of time--this data begins with the students in ninth grade and follows their educational career well into their tertiary education.
We will begin by loading in all of the necessary libraries that we will need to perform our collection. Our primary tool for this section, and generally throughout our analysis, is the pandas library in Python. Pandas provides a fantastic way to utilize objects known as DataFrames, which are headered tables that carry different data types in each column. It is fast, easy-to-use, and very user friendly. You can read more about pandas here. pandas relies in numpy in its calculations, so we pre-load it in case we want to use any numpy functions.
#Import pandas and numpy
import numpy as np
import pandas as pd
With that out of the way, let us now import the HSLS:09 study. This data is included on my computer in the file hsls_17_student_pets_sr_v1_0.csv; csv stands for "comma separated values" and is a very common format for raw data collection. If one would like to download this data for themselves, here is how to do it:
Be wary if you would like to download this data for your home computer: the full .csv file contains 888.2MB of information when unzipped! Reading this file can take a while on just about any computer out there.
db = pd.read_csv("./notebooks/hsls_17_student_pets_sr_v1_0.csv")
db
| STU_ID | SCH_ID | X1NCESID | X2NCESID | STRAT_ID | PSU | X2UNIV1 | X2UNIV2A | X2UNIV2B | X3UNIV1 | ... | W5W1W2W3W4PSRECORDS191 | W5W1W2W3W4PSRECORDS192 | W5W1W2W3W4PSRECORDS193 | W5W1W2W3W4PSRECORDS194 | W5W1W2W3W4PSRECORDS195 | W5W1W2W3W4PSRECORDS196 | W5W1W2W3W4PSRECORDS197 | W5W1W2W3W4PSRECORDS198 | W5W1W2W3W4PSRECORDS199 | W5W1W2W3W4PSRECORDS200 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10001 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 0.000000 | 2098.087446 | 1824.641398 | 0.000000 | 2431.665487 | 0.0 | 0.000000 | 2457.423209 | 0.0 | 2053.407870 |
| 1 | 10002 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| 2 | 10003 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| 3 | 10004 | -5 | -5 | -5 | -5 | -5 | 10 | 1 | 7 | 1001 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| 4 | 10005 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23498 | 35202 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 6 | 1111 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| 23499 | 35203 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
| 23500 | 35204 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 249.221073 | 217.039152 | 0.000000 | 236.368745 | 0.000000 | 0.0 | 386.330427 | 0.000000 | 0.0 | 0.000000 |
| 23501 | 35205 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 740.024218 | 0.000000 | 0.000000 | 0.000000 | 775.191655 | 0.0 | 0.000000 | 1006.429693 | 0.0 | 862.728189 |
| 23502 | 35206 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
23503 rows × 9614 columns
So we have our data. Note that this data is over 9000 columns long, which is a ridiculous amount of information! First, what I would like to do is drop (database terminology for getting rid of) all of the columns that involve information regarding replicate weights.
Replicate weights essentially allow for one to derive standard errors from survey responses. While this is an important way to ensure results are slightly more accurate, it is unnecessary for our purposes and adds many thousands of columns of data, so we will drop it.
db.drop(db.columns[4014:], axis='columns', inplace=True)
db.head()
| STU_ID | SCH_ID | X1NCESID | X2NCESID | STRAT_ID | PSU | X2UNIV1 | X2UNIV2A | X2UNIV2B | X3UNIV1 | ... | X5PFYNETPRICEGRT_IM | X5PFYPELLPACK_IM | X5PFYTOTLOAN_IM | X5PFYTOTLOAN2_IM | X5PFYTOTLOAN3_IM | X5EVRFEDAPP_IM | X5FEDAPP14_IM | X5FEDAPP15_IM | X5FEDAPP16_IM | X5PFYTUITION_IM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10001 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 10002 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 |
| 2 | 10003 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 |
| 3 | 10004 | -5 | -5 | -5 | -5 | -5 | 10 | 1 | 7 | 1001 | ... | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 |
| 4 | 10005 | -5 | -5 | -5 | -5 | -5 | 11 | 1 | 1 | 1111 | ... | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 |
5 rows × 4014 columns
(Every row after 4014 was a replicate weight, which is why the number was chosen).
With this load and change, we have completed the data collection process. Of course, other interests or studies may have required a broader range of tools that we would need to use. Some of these tools include:
requests, which is a tool for retrieving raw HTML from a website. The process of web scraping can be very difficult and elaborate, but requests simplifies the process of actually extracting that HTML from a website. Its documentation can be found at this website.urllib can be used in place of requests to retrieve data from websites.BeautifulSoup, which is specifically designed for extracting data from raw HTML. Often used in tandem with requests. More information on BeautifulSoup can be found at this website.Of course, our data exists in pandas DataFrame format, but that does not mean that it is entirely useful just yet. This, of course, leads us to...
We now move onward to the task of trying to process the data in a way that makes it more "palatable" to a human reader. What we'd like to do is extract the useful information from these 4000 columns to make sure we specifically highlight the aspects we're trying to look at. This way, analysis of these factors comes more quickly to us in the future.
Our first task in the data processing step is to get rid of all of the supressed columns in the dataset. The NCES website mentions that the public data available to anyone from the HSLS:09 withholds some of the information given in the survey response, like social strata, what college a student went to, their school code, and other similar information.
Thankfully, the NCES is extremely thorough in their documentation and has helpfully included a "layout" file that tells us what each column means and what values it can take. Columns where the data is surpressed simply read -5 for all values, which tells us that the data was removed specifically for the public release. We will move forward with getting rid of all of these rows now.
drop_cols = []
for col in db:
unique = db[col].unique()
if len(unique) == 1 and unique[0] == -5:
drop_cols.append(col)
db.drop(drop_cols, axis='columns', inplace=True)
db.head()
| STU_ID | X2UNIV1 | X2UNIV2A | X2UNIV2B | X3UNIV1 | X4UNIV1 | W1STUDENT | W1PARENT | W1MATHTCH | W1SCITCH | ... | X5PFYNETPRICEGRT_IM | X5PFYPELLPACK_IM | X5PFYTOTLOAN_IM | X5PFYTOTLOAN2_IM | X5PFYTOTLOAN3_IM | X5EVRFEDAPP_IM | X5FEDAPP14_IM | X5FEDAPP15_IM | X5FEDAPP16_IM | X5PFYTUITION_IM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10001 | 11 | 1 | 1 | 1111 | 11111 | 375.667105 | 470.250141 | 423.238620 | 393.169508 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 10002 | 11 | 1 | 1 | 1111 | 11111 | 189.309446 | 224.455466 | 329.640843 | 207.892322 | ... | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 |
| 2 | 10003 | 11 | 1 | 1 | 1111 | 11111 | 143.591863 | 185.301339 | 231.718703 | 0.000000 | ... | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 |
| 3 | 10004 | 10 | 1 | 7 | 1001 | 10011 | 227.937019 | 301.431713 | 261.518593 | 306.102816 | ... | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 |
| 4 | 10005 | 11 | 1 | 1 | 1111 | 11111 | 145.019401 | 190.834136 | 169.946035 | 188.432535 | ... | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 |
5 rows × 3082 columns
So now we've gotten rid of supressed data, which thankfully helps a lot with thinning out the columns. Unfortunately, there are still over 3000. This is where we must get selective with what we keep and what we don't.
The NCES layout file of the HSLS:09 appears to categorize the variables into distinct categories by the first letter of their variable name:
X- series variables involve "factual" metrics for the student, like GPA, number of credits in a certain field, parental education, and so on. This is where our data for race and sex is located.S- series variables involve survey responses by the student to questionnaires produced by the NCES. This is where data for gender is located.P- series variables regard the parents' perceptions of their child and the scholastic system. It appears these are also survey responses.M- and N- series variables involve the teachers and professors at the student's school.A- series variables involve statistics regarding the school itself.C- series variables involve counseling and support services offered by the school.In order to "thin the herd", so to speak, we can get rid of the P, M, N, A, and C series variables. These are important metrics for academic success, of course, but we're focused more on the specific question that involves how a student's sex and ethnicity affects their scholastic achievement. As a consequence, we are safe to remove these variables.
Let us do that now:
drop_cols = []
for col in db:
if col.startswith(("P", "M", "N", "A", "C")):
drop_cols.append(col)
db.drop(drop_cols, axis="columns", inplace=True)
db.head()
| STU_ID | X2UNIV1 | X2UNIV2A | X2UNIV2B | X3UNIV1 | X4UNIV1 | W1STUDENT | W1PARENT | W1MATHTCH | W1SCITCH | ... | X5PFYNETPRICEGRT_IM | X5PFYPELLPACK_IM | X5PFYTOTLOAN_IM | X5PFYTOTLOAN2_IM | X5PFYTOTLOAN3_IM | X5EVRFEDAPP_IM | X5FEDAPP14_IM | X5FEDAPP15_IM | X5FEDAPP16_IM | X5PFYTUITION_IM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10001 | 11 | 1 | 1 | 1111 | 11111 | 375.667105 | 470.250141 | 423.238620 | 393.169508 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 10002 | 11 | 1 | 1 | 1111 | 11111 | 189.309446 | 224.455466 | 329.640843 | 207.892322 | ... | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 |
| 2 | 10003 | 11 | 1 | 1 | 1111 | 11111 | 143.591863 | 185.301339 | 231.718703 | 0.000000 | ... | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 |
| 3 | 10004 | 10 | 1 | 7 | 1001 | 10011 | 227.937019 | 301.431713 | 261.518593 | 306.102816 | ... | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 |
| 4 | 10005 | 11 | 1 | 1 | 1111 | 11111 | 145.019401 | 190.834136 | 169.946035 | 188.432535 | ... | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 | -6 |
5 rows × 1977 columns
Fantastic! We've removed a good chunk of the columns and are now down to less than half of what we started with. But that still leaves us with almost 2000 columns to play with--so which ones are useful to us?
This is where hand-combing through the rest will take us. We've already performed systematic removal of a lot of variables, so now we can start playing with going through one-by-one and looking for ones that interest us. Specifically, I'd like the variables to answer two very specific questions:
For this reason, I decided to keep only explicit data on sex and ethnicity, as well as a small set of survey questions that I thought sounded interesting and primarily gauge the student's perceptions of STEM and academic success. Unfortunately, we cannot use gender data, as it was surpressed in the public set.
These specific variables are listed in the arrays in the code below. We now filter specifically these columns. Explanations of the different variables will come when they are of importance to us--for now, rest assured that I have spent quite some time in picking out interesting variables for us to use.
cols = ["X1SEX", "X1RACE", "X1TXMTH", "X1PAR1RACE", "X1PAR2RACE", "X1SES", "X1MTHEFF", "X1SCIEFF",
"X2ENROLSTAT", "X2DROPSTAT", "X2MEFFORT", "X2SEFFORT",
"X3DROPOUTTIME", "X3DROPSTAT", "X3HSCOMPSTAT",
"X3TGPAMAT", "X3TGPASCI", "X3TGPACOMPSCI", "X3TGPASTEM", "X3TCREDSTEM",
"X4EVRAPPCLG", "X4ENTMJSTNSF", "X4SIBPSE",
"X5STEMCRED", "X5HIGHDEG", "X5STEM1ATT", "X5STEM1GPA", "X5GPAALL", "X5OWEAMT",
"S1MPERSON1", "S1MPERSON2", "S1SPERSON1", "S1SPERSON2",
"S1SCHWASTE", "S1FAVSUBJ", "S1LEASTSUBJ", "S1EDUEXPECT", "S1ABILITYBA",
"S2ABSENT", "S2EDUASP", "S2SUREDIPL", "S2FOCUS2013", "S2TYPEPS2013", "S2FIRSTCHOICE",
"S2ENGCOMP", "S2MTHCOMP", "S2SCICOMP",
"S3CANTAFFORD", "S3CURWORK",
"S4REPUTATION", "S4COSTATTEND", "S4EDUEXP",
"S4MLEARN", "S4MBORN", "S4SLEARN", "S4SBORN",
"S4MTHMF", "S4SCIMF", "S4CSIMF", "S4ENGMF", "S4MTHRC", "S4SCIRC", "S4CSIRC", "S4ENGRC",
"S4GOODINVEST", "S4STUDOREMP"]
db = db[cols]
db
| X1SEX | X1RACE | X1TXMTH | X1PAR1RACE | X1PAR2RACE | X1SES | X1MTHEFF | X1SCIEFF | X2ENROLSTAT | X2DROPSTAT | ... | S4MTHMF | S4SCIMF | S4CSIMF | S4ENGMF | S4MTHRC | S4SCIRC | S4CSIRC | S4ENGRC | S4GOODINVEST | S4STUDOREMP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 8 | 0.8304 | 8 | 8 | 1.5644 | 0.95 | 1.10 | 1 | 0 | ... | 1 | 1 | -7 | -7 | 1 | 1 | -7 | -7 | 2 | 1 |
| 1 | 2 | 8 | -0.2956 | 5 | 8 | -0.3699 | 0.55 | 0.69 | 1 | 0 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | 2 | 1 |
| 2 | 2 | 3 | 1.2997 | 3 | -7 | 1.2741 | 0.68 | -0.05 | 1 | 0 | ... | 4 | 4 | -7 | -7 | 4 | 4 | -7 | -7 | 1 | 1 |
| 3 | 2 | 8 | -0.1427 | 8 | -7 | 0.5498 | 0.10 | 0.25 | 1 | 0 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 |
| 4 | 1 | 8 | 1.1405 | 8 | -7 | 0.1495 | 0.10 | 0.25 | 1 | 3 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23498 | 2 | 5 | 0.6572 | -8 | -8 | 0.0205 | -1.75 | 1.83 | 5 | 1 | ... | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 |
| 23499 | 2 | 5 | -0.4529 | -9 | -7 | -1.2098 | 0.82 | -0.36 | 1 | 0 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 |
| 23500 | 2 | 8 | -0.0935 | 8 | 8 | -0.0649 | -0.17 | -1.02 | 2 | 0 | ... | 4 | 3 | -7 | -7 | 4 | 4 | -7 | -7 | 1 | -3 |
| 23501 | 1 | 8 | 1.0181 | 8 | -7 | 0.8512 | 0.10 | -0.05 | 1 | 0 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | 2 | 2 |
| 23502 | 2 | 8 | 1.8673 | 8 | 8 | 1.6397 | -0.58 | -0.46 | 1 | 0 | ... | -7 | 4 | -7 | -7 | -7 | 4 | -7 | -7 | 1 | 1 |
23503 rows × 66 columns
Now we have a database with a bunch of data and a curated list of important and useful columns, which we can now use to do exploratory analysis.
"Great!", you may be saying at this very moment. But what about the rows? We haven't done any useful analysis on those. All we want is to remove rows where race or sex is not present. According to the HSLS:09 layout, this corresponds to a value in that row of -9; so let's remove all rows with that value.
indices = []
for index,row in db.iterrows():
if (row["X1SEX"] == -9) or (row["X1RACE"] == -9):
indices.append(index)
db.drop(indices, inplace=True)
db
/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py:4167: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return super().drop(
| X1SEX | X1RACE | X1TXMTH | X1PAR1RACE | X1PAR2RACE | X1SES | X1MTHEFF | X1SCIEFF | X2ENROLSTAT | X2DROPSTAT | ... | S4MTHMF | S4SCIMF | S4CSIMF | S4ENGMF | S4MTHRC | S4SCIRC | S4CSIRC | S4ENGRC | S4GOODINVEST | S4STUDOREMP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 8 | 0.8304 | 8 | 8 | 1.5644 | 0.95 | 1.10 | 1 | 0 | ... | 1 | 1 | -7 | -7 | 1 | 1 | -7 | -7 | 2 | 1 |
| 1 | 2 | 8 | -0.2956 | 5 | 8 | -0.3699 | 0.55 | 0.69 | 1 | 0 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | 2 | 1 |
| 2 | 2 | 3 | 1.2997 | 3 | -7 | 1.2741 | 0.68 | -0.05 | 1 | 0 | ... | 4 | 4 | -7 | -7 | 4 | 4 | -7 | -7 | 1 | 1 |
| 3 | 2 | 8 | -0.1427 | 8 | -7 | 0.5498 | 0.10 | 0.25 | 1 | 0 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 |
| 4 | 1 | 8 | 1.1405 | 8 | -7 | 0.1495 | 0.10 | 0.25 | 1 | 3 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23498 | 2 | 5 | 0.6572 | -8 | -8 | 0.0205 | -1.75 | 1.83 | 5 | 1 | ... | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 | -8 |
| 23499 | 2 | 5 | -0.4529 | -9 | -7 | -1.2098 | 0.82 | -0.36 | 1 | 0 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 |
| 23500 | 2 | 8 | -0.0935 | 8 | 8 | -0.0649 | -0.17 | -1.02 | 2 | 0 | ... | 4 | 3 | -7 | -7 | 4 | 4 | -7 | -7 | 1 | -3 |
| 23501 | 1 | 8 | 1.0181 | 8 | -7 | 0.8512 | 0.10 | -0.05 | 1 | 0 | ... | -7 | -7 | -7 | -7 | -7 | -7 | -7 | -7 | 2 | 2 |
| 23502 | 2 | 8 | 1.8673 | 8 | 8 | 1.6397 | -0.58 | -0.46 | 1 | 0 | ... | -7 | 4 | -7 | -7 | -7 | 4 | -7 | -7 | 1 | 1 |
22496 rows × 66 columns
This gets rid of a final 1000 rows that wouldn't be able to help us answer our question, as their sex and race is unknown. pandas gives us a warning, but that's something we can ignore moving forward, since we're not going to be using the original database any more. With this, we can finally move forward with performing analysis on our data. And we can also rest assured that it is nice and tidy!
This is the intermediate phase of the data science pipeline. We are moving away from the data wrangling portion and towards the analysis and testing portion of the pipeline. This is usually the more fun part on the side of the data scientist, as one stops futzing with huge databases and starts playing with data.
Now is as important of a time as ever to note that although we are traversing what is known as the data science pipeline, that doesn't necessarily mean the pipeline is linear. A better analogy is to think of working in data science as like solving an escape room puzzle. Often times one will need to go back, look at old clues and sections of the "room", before they can move forward to the next step. Don't be afraid to go back and edit your database should the need arise!
The point of this section is mainly to get creative and see if you observe any interesting trends. To that end, we'll be performing five different analyses, to see if there's anything interesting worth noting. We will go through them one by one. The point of this isn't to create a research paper out of each one, merely to think of interesting ideas that utilize the data we've been working on and to see if there are any interesting ideas to explore. It's okay if an idea ends up being a dead end or uninteresting!
These ideas fall roughly in the order in which I thought to explore them. My immediate thought was to attempt to see how many ninth graders ended up graduating from high school, and to examine how this number changed across race and sex. This way we can see if the United States has any notable discrepancies.
This involves the usage of the following variables:
X1SEX: The sex of the student. 1 is male and 2 is female.X1RACE: The race of the student. Eight different races are recognized, with associated numbers 1-8.X3HSCOMPSTAT: The high school completion status of the student. Diploma, GED, dropout, or still enrolled.To perform this analysis, we will generate two tables that tally the type of high school completion by race. One will be for men, and the other for women. We will also calculate the proportion of students of each race and sex to see how well they do within their race. Let us do this now:
# a table of races, we will use this later in our data visualization
race_table = ["American Indian/Alaska Native", "Asian", "African-American", "Non-racial Hispanic",
"Racial Hispanic", "Multi-racial", "Native Hawaiian/Pacific Islander", "White"]
grad_table = ["High School Diploma", "GED/HS equiv.", "Dropped Out", "Still Enrolled", "Unknown"]
# get the number of high school outcomes by type for each race and sex
grad_count = np.zeros((2,8,5), dtype=int)
for i,row in db.iterrows():
grad_count[int(row["X1SEX"]-1)][int(row["X1RACE"]-1)][int(row["X3HSCOMPSTAT"]-1)] += 1
# get the combined grad count of both sexes
combined_grad_count = [grad_count[0][i][j] + grad_count[1][i][j] for i in range(8) for j in range(5)]
combined_grad_count = np.reshape(combined_grad_count, (8,5))
# get the total student count for each race
total_race = [np.sum(combined_grad_count[i]) for i in range(8)]
# get the proportion of students of each high school outcome for each race
proportion_grad = [combined_grad_count[i][j] / total_race[i] for i in range(8) for j in range(5)]
proportion_grad = np.reshape(proportion_grad, (8,5))
# return that proportion
df_prop_grad = pd.DataFrame(data=proportion_grad, index=race_table, columns=grad_table)
df_prop_grad
| High School Diploma | GED/HS equiv. | Dropped Out | Still Enrolled | Unknown | |
|---|---|---|---|---|---|
| American Indian/Alaska Native | 0.642424 | 0.042424 | 0.109091 | 0.066667 | 0.139394 |
| Asian | 0.878074 | 0.011270 | 0.010246 | 0.017930 | 0.082480 |
| African-American | 0.759494 | 0.042466 | 0.040016 | 0.049816 | 0.108207 |
| Non-racial Hispanic | 0.661137 | 0.026066 | 0.071090 | 0.063981 | 0.177725 |
| Racial Hispanic | 0.770963 | 0.027259 | 0.036148 | 0.044444 | 0.121185 |
| Multi-racial | 0.805255 | 0.027821 | 0.034518 | 0.039155 | 0.093251 |
| Native Hawaiian/Pacific Islander | 0.718182 | 0.009091 | 0.036364 | 0.072727 | 0.163636 |
| White | 0.848866 | 0.025741 | 0.029631 | 0.022099 | 0.073663 |
We note a couple interesting points here, which could be worth expounding upon:
Do you notice any other interesting trends?
My next idea was to determine the retention rate of students by sex and ethnicity. To do this, we will use the following variables. Note that this analysis comes a bit more subjectively, since the metrics are not measuring exactly the same thing every time.
X1RACE: As before.X2ENROLSTAT: Whether a student was still enrolled by the second semester of 11th grade.X3DROPSTAT: Whether a student had dropped out before receiving their diploma.X4EVRAPPCLG: Whether a student had ever applied or was currently in college by the traditional sophomore year timing.X5HIGHDEG: The highest degree received by the end of the study, concluded when most students would have received their bachelors' degree.As you can notice, these don't measure the same metric every time. Unfortunately, since the survey and questions were different at each sample point, the retention rate measurement can't be perfect or equivalent. Hence, I decide to do the following:
I chose these specific points for their flexibility regarding gap years, but also to track which students went on to pursue college. Ideally they will give us a rough idea as to how many students are actively pursuing school at each survey point.
enroll_table = ["9th grade", "1st sem. 11th grade", "High School", "College app", "Associates+"]
enrollment_rate = np.zeros((8,5), dtype=int)
for i,row in db.iterrows():
enrollment_rate[int(row["X1RACE"]-1)][0] += 1
enrollment_rate[int(row["X1RACE"]-1)][1] += (1 if row["X2ENROLSTAT"] <= 4 else 0)
enrollment_rate[int(row["X1RACE"]-1)][2] += (0 if row["X3DROPSTAT"] == 1 else 1)
enrollment_rate[int(row["X1RACE"]-1)][3] += (1 if row["X4EVRAPPCLG"] == 1 else 0)
enrollment_rate[int(row["X1RACE"]-1)][4] += (1 if row["X5HIGHDEG"] >= 2 else 0)
df_enrollment = pd.DataFrame(data=enrollment_rate, index=race_table, columns=enroll_table)
df_enrollment
| 9th grade | 1st sem. 11th grade | High School | College app | Associates+ | |
|---|---|---|---|---|---|
| American Indian/Alaska Native | 165 | 149 | 149 | 84 | 2 |
| Asian | 1952 | 1850 | 1938 | 1420 | 80 |
| African-American | 2449 | 2268 | 2364 | 1494 | 78 |
| Non-racial Hispanic | 422 | 355 | 396 | 206 | 10 |
| Racial Hispanic | 3375 | 3135 | 3257 | 1942 | 129 |
| Multi-racial | 1941 | 1826 | 1878 | 1205 | 55 |
| Native Hawaiian/Pacific Islander | 110 | 105 | 106 | 56 | 3 |
| White | 12082 | 11493 | 11723 | 7683 | 603 |
Note that the fifth survey occurred towards the end of the entire study, so the data from X5HIGHDEG is likely underrepresentative of the proportion of the entire sample that achieved an associates' degree.
Given the type of analyses we performed earlier, do you notice anything interesting in these findings? Any surprises? Think about what you might want to test or examine further, having looked at these numbers. What stands out to you?
My third analysis, and the first of my original analysis ideas, is to see what involvement people of a given sex and gender have with the STEM discipline. Does it appear that a given race or sex is more attracted to STEM than another? To answer this question, we use the following variables:
X1SEX, X1RACEX5STEMCRED: Whether or not a student received a degree in a STEM field.X5STEM1GPA: The GPA of a student over all of the STEM classes they took in college.X5GPAALL: The GPA of a student over all of their classes in college.What I'd like to do is create two different tables, that highlight the following info:
To do this, we will look at the number of degrees accredited by race and use those points to find a proportion of a given race that gets a certain degree. We can compare the GPA of students by trying to find the mean GPA of the race/sex for their STEM classes and for every class.
Here we use the mean, which is a measure of central tendency: other examples of measures that could work appropriately include the median. Here we use the mean because I am making the assumption that the data for GPAs is roughly normally distributed within each race and sex, which means that the probability of having a GPA above or below the mean falls off in roughly equal proportion. Hence, the mean is not overly affected by outliers and can be used to earmark what an "average" GPA for a given race or sex might look like.
#preparing tables and filtering the database for our searches
sex_table = ["Male", "Female"]
filt_db = db[(db["X5STEMCRED"] >= 0) & (db["X5STEM1GPA"] >= 0)]
stem_gpa = np.zeros((8,2))
all_gpa = np.zeros((8,2))
total_race_college = np.zeros((8,2))
stem_degree_college = np.zeros((8,2))
#sum up the GPA of all of the students by their race and sex
for i, row in filt_db.iterrows():
stem_gpa[int(row["X1RACE"]-1)][int(row["X1SEX"]-1)] += row["X5STEM1GPA"]
all_gpa[int(row["X1RACE"]-1)][int(row["X1SEX"]-1)] += row["X5GPAALL"]
total_race_college[int(row["X1RACE"]-1)][int(row["X1SEX"]-1)] += 1
#tally which students of a given race have a STEM degree
for i, row in filt_db.iterrows():
stem_degree_college[int(row["X1RACE"]-1)][int(row["X1SEX"]-1)] += (1 if row["X5STEMCRED"] == 1 else 0)
#calculate the mean by dividing by the total number of students in the category
for i in range(8):
for j in range(2):
stem_gpa[i][j] /= total_race_college[i][j]
all_gpa[i][j] /= total_race_college[i][j]
#create dataframes from the arrays
df_stem_gpa = pd.DataFrame(data=stem_gpa, index=race_table, columns=sex_table)
df_all_gpa = pd.DataFrame(data=all_gpa, index=race_table, columns=sex_table)
df_stem_gpa
| Male | Female | |
|---|---|---|
| American Indian/Alaska Native | 2.369231 | 2.283333 |
| Asian | 2.789256 | 2.930233 |
| African-American | 1.826909 | 2.086911 |
| Non-racial Hispanic | 2.039130 | 2.490244 |
| Racial Hispanic | 2.243557 | 2.394882 |
| Multi-racial | 2.451812 | 2.511976 |
| Native Hawaiian/Pacific Islander | 2.653846 | 2.253846 |
| White | 2.577737 | 2.762724 |
The last two analyses are not quite as in depth. They were simply curiosities of mine that happened to arise while thinking about the previous 3 analyses. For this analysis, I am curious to see whether or not students who are naturally gifted at math (or, on the contrary, are not good at it) believe that the ability to do mathematics is a skill or a trait. To do this, the following variables are used:
X1SEXX1MTHEFF: The "Self-efficacy" of the student in mathematics, that is, their self-perceived ability to be able to do mathematics problems.X1SCIEFF: As above, but for science.S4MLEARN, S4MBORN: Whether or not a student believes that math can be learned, and believes whether or not you can be born good at mathematics, respectively. Rated from 1, strongly agree, to 4, strongly disagree.S4SLEARN, S4SBORN: As above, but for science instead of mathematics. Rated similarly.Here we are interested in seeing whether or not the response correlates with the self-efficacy of the student. Also, I am interested in seeing how sex correlates with the variables.
#creating tables and filtering undesirable results from the database
skill_table = ["Math is a Skill", "Math is a Trait", "Science is a Skill", "Science is a Trait"]
agree_table = ["Strongly Agree", "Agree", "Disagree", "Strongly Disagree"]
# here we filter any of the codes that are negative, because they correspond to data points that
# either were not answered or by students who did not go on to complete the fourth survey
filt_db = db[["X1SEX", "X1MTHEFF", "X1SCIEFF", "S4MLEARN", "S4MBORN", "S4SLEARN", "S4SBORN"]]
filt_db = filt_db[filt_db["S4MLEARN"] > 0]
filt_db = filt_db[filt_db["S4MBORN"] > 0]
filt_db = filt_db[filt_db["S4SLEARN"] > 0]
filt_db = filt_db[filt_db["S4SBORN"] > 0]
#tallying the counts for each survey element by student
stem_skill = np.zeros((2,4,4), dtype=int)
for i, row in filt_db.iterrows():
stem_skill[int(row["X1SEX"]-1)][0][int(row["S4MLEARN"]-1)] += 1
stem_skill[int(row["X1SEX"]-1)][1][int(row["S4MBORN"]-1)] += 1
stem_skill[int(row["X1SEX"]-1)][2][int(row["S4SLEARN"]-1)] += 1
stem_skill[int(row["X1SEX"]-1)][3][int(row["S4SBORN"]-1)] += 1
# combining the two sexes into one table, which is what we will display at the end
combined_stem_skill = np.zeros((4,4), dtype=int)
for i in range(4):
for j in range(4):
combined_stem_skill[i][j] = stem_skill[0][i][j] + stem_skill[1][i][j]
c_skill_df = pd.DataFrame(data=combined_stem_skill, index=skill_table, columns=agree_table)
c_skill_df
| Strongly Agree | Agree | Disagree | Strongly Disagree | |
|---|---|---|---|---|
| Math is a Skill | 3045 | 9514 | 2207 | 301 |
| Math is a Trait | 774 | 3233 | 8451 | 2609 |
| Science is a Skill | 2863 | 10300 | 1756 | 148 |
| Science is a Trait | 576 | 2834 | 9010 | 2647 |
There are many more variables that I was unable to perform analyses on, but that were of curiosity to me. See if you can think of things to do with them! The ones I did not use are numerous and test a variety of aspects of a student's life and beliefs. You can find out what they mean and use them by doing the following:
Layout_STUDENT.txt. You can use Ctrl+F on your keyboard to search for certain variables and find out what they mean.Try to think of some interesting tests that may yield surprising results. Consider the following analysis, if you are hesitant to try to find something new:
Suppose that you want to see what the students thought about racial and gender bias amongst their professors. This involves the usage of the following variables:
X1SEX, X1RACES4MTHMF, S4SCIMF, S4CSIMF, S4ENGMF: Whether or not students had felt that instructors treated men and women differently in math, science, computer science, and engineering, respectively. Rated from 1, strongly agree, to 4, strongly disagree.S4MTHRC, S4SCIRC, S4CSIRC, S4ENGRC: As above, but for racial bias rather than gender bias. Rated similarly.Think about what type of categories (combinations of race/sex) that might produce interesting results for you to look at. What kinds of questions might you be able to ask from your final table? Do you think plotting it might give you something visually interesting?
With our analyses complete, we now move to doing something pretty with our analysis!
Now that we have successfully performed a variety of analyses on our data, it is in our best interest to get a visual representation of what we've been doing this whole time! The whole point of this is to make pretty pictures and see if any trends appear visually that aren't immediately obvious from the tables and data provided above.
Note that normally this step and the previous are done in tandem; usually once one has completed some bit of exploratory analysis that yields them a subset of the data or an interesting table, they move straight ahead towards plotting it looking for visually interesting characteristics. And, after all, who doesn't like looking at pretty pictures?
I have decided to split these two sections up specifically to recognize that these are two different ways of tackling the problem of understanding and looking for trends in your data. A good data scientist takes a hybrid approach in this matter.
For analysis 1, our culminating result was a table giving the proportion of students from a given race that had one of five high school outcomes--graduation, a GRE or high school equivalent, dropping out, continuing education or unknown.
The best way I could think of representing this data was by using a bar chart, to see how each type of high school outcome stacked against one another.
One can make nice plots of pandas data by using the matplotlib library in Python. We will specifically be using the pyplot package of matplotlib, but the entire library is extensive and extremely useful! You can find more information about it at this link.
from matplotlib import pyplot as plt
w = 0.15
fig, axis = plt.subplots(figsize=(20,8))
b1 = axis.bar(np.arange(8)-2*w, df_prop_grad["High School Diploma"], w, color = 'green')
b2 = axis.bar(np.arange(8)-w, df_prop_grad["GED/HS equiv."], w, color = 'mediumaquamarine')
b3 = axis.bar(np.arange(8), df_prop_grad["Still Enrolled"], w, color = 'gold')
b4 = axis.bar(np.arange(8)+w, df_prop_grad["Dropped Out"], w, color = 'firebrick')
b5 = axis.bar(np.arange(8)+2*w, df_prop_grad["Unknown"], w, color = 'gray')
axis.set_ylabel("Proportion of students")
axis.set_xticks(np.arange(8))
axis.set_xticklabels(race_table)
l = axis.legend(grad_table)
Visually, our plot tells us much the same story as our original analysis--however, it makes the job of analyzing that data much easier, and quite interesting! Hopefully now you understand why the two are usually performed in tandem.
Part of data science is being able to effectively communicate the information that you have synthesized in your analyses. A critical part of that is ensuring that you are using the proper type of graph for the data that you have on hand. For example, I could have done a scatter plot for Analysis 1, but it would be almost meaningless--you would end up leaving the graph with more questions than answers, some of them relating to my well-being and sanity. However, since I chose a bar graph, you likely got a nice visual look at what was going on with the data. Great! That is the goal.
We cannot simply approach every graph we have with a bar chart, however. Since our data for Analysis 2 involved the change in retention rates over time and involves a large volume, we can use a stack plot to succinctly communicate this information.
years = [0, 2, 4, 6, 8]
fig, axis = plt.subplots(figsize=(20,8))
axis.stackplot(years, df_enrollment)
axis.set_ylabel("Number of students at level")
axis.set_xlabel("Educational Attainment")
axis.set_xticks(years)
axis.set_xticklabels(enroll_table)
l = axis.legend(race_table)
This plot told us something that simply looking at the data did not: what proportion of a race is at a given level compared to another. We note that the populations decline dramatically after high school ends, as many students choose not to attend college.
We also note that white students make up more than half of the initial data, and yet appear roughly equal (if not less) than the number of minority students that apply to colleges. We could have chosen a line graph to also represent this information, or a pie chart if we wanted to look at a certain point in time.
Here our data gave us the average GPA and number of students enrolled in college for each race. Let us make two different plots: one that looks at how the difference between STEM and overall GPA varies between students, and another for the variation between men and women in STEM degree attainment.
diff_gpa = np.subtract(all_gpa, stem_gpa)
from matplotlib import pyplot as plt
w = .4
fig, axes = plt.subplots(2, figsize=(20,16))
b1 = axes[0].bar(np.arange(8)-w/2, diff_gpa[:,0], w, color = 'steelblue')
b2 = axes[0].bar(np.arange(8)+w/2, diff_gpa[:,1], w, color = 'plum')
b3 = axes[1].bar(np.arange(8)-w/2, stem_degree_college[:,0], w, color = 'steelblue')
b4 = axes[1].bar(np.arange(8)+w/2, stem_degree_college[:,1], w, color = 'plum')
axes[0].set_ylabel("GPA Difference")
axes[0].set_title("GPA Difference between all classes and STEM")
axes[0].set_xticks(np.arange(8))
axes[0].set_xticklabels(race_table)
l = axes[0].legend(sex_table)
axes[1].set_ylabel("Number of STEM degrees")
axes[1].set_title("Number of STEM degrees by race and sex")
axes[1].set_xticks(np.arange(8))
axes[1].set_xticklabels(race_table)
l2 = axes[1].legend(sex_table)
Now it's your turn! Do you think that the data from Analysis 4, the Extra Analysis, or your own analysis could produce a visually interesting image? Try to visualize the data you've created and
A couple questions you should be asking yourself along the way:
With our data visualization and exploratory analysis complete, let's now move forward with performing hypothesis testing.
Remember how previously we had been asking questions about our data and trying to look for visual and numerical trends? In effect, hypothesis testing is our way of quantifying how much those trends subvert our expectations, and if there really is something interesting lurking beneath all of the numbers.
So, how does one perform a hypothesis test? Well, it all starts with a hypothesis. We need to formulate the question we want to answer first before we can answer it. The questions I'd like to ask are the following two, which should give the new data scientist an idea of what kinds of questions are normally asked at this phase.
Two answer these questions, we will perform two different types of hypothesis tests: Chi-squared tests, which measure for differences in population amongst groups, suitable for the first test; and Linear regression analysis, which looks for linear trends in data and determines whether or not a significant correlation exists between the two. Let us answer the first question now.
In order to perform our Chi-Squared tests, we will be using the scipy library in Python. You can read more about its documentation at this link.
Let us now use scipy to test the hypothesis that race influences one's high school graduation rate. Here, we make an assumption, called the null hypothesis, that we would like to see whether or not we can reject. Our null hypothesis in the case of chi-squared testing is that all of the data should occur at the same rate. Think about it--if our education system was ideal, this would be true, and all students would have the same graduation rate regardless of upbringing. So, let's test it now:
from scipy.stats import chisquare
chisquare(np.transpose(proportion_grad * 100.0))
Power_divergenceResult(statistic=array([125.14233242, 289.18495742, 197.21291897, 139.2578783 ,
206.54740192, 230.31500358, 174.62809917, 264.01442348]), pvalue=array([4.25547426e-26, 2.33036158e-61, 1.49295346e-41, 4.06928988e-29,
1.46884624e-43, 1.12922292e-48, 1.06174414e-36, 6.22116269e-56]))
Here we find that there is a significant correlation between the race of a student and the probability that they will graduate. Hence, it appears that our chi-squared test can reject the null hypothesis, and the race of a student directly affects their graduation rate.
Instead of scipy for this test, we will instead be using statsmodels, a robust linear regression library for Python. You can read more about its documentation at this link. Both are great tools for performing statistical analysis on your data!
You may be asking--what is a linear regression? In effect, a linear regression is just like the line of best fit that you may have seen in algebra 1 or statistics, but generalized to support any number of variables.
We can then test, in our specific example, whether or not there is a correlation between the self-efficacy, race, and sex of a student, and whether or not that student believes that mathematics is a skill that can be taught to them. Unfortunately, I do not have enough time to run this analysis now, but one can look at this specific documentation page to get a peek at what kinds of linear regression modelling can be done.
Although this tutorial does not get into the very complicated subject of machine learning, ML is still an extremely valuable tool to the data scientist.
Machine Learning, at its core, is the idea that one can create a model that will detect patterns in data and utilize those patterns to make predictions. These models can then be used to look at new data points and evaluate certain things about them. For example, suppose we have finished our hypothesis testing, and want to determine the following question:
Given certain characteristics about a student and their high school education, can we predict whether or not that student will graduate from college?
We could then decide on a specific set of variables that we think make for good training data. Since a good model is motivated with sound reasoning, we need to argue for our individual choices.
X1SEX and X1RACE, for their immediate distinction in our exploratory phase. Do note that including these types of variables raises certain ethical questions for the data scientist: are you implementing bias into your machine learning model because it can distinguish one race for another? If you use your model as a predictor, will it bias against disadvantaged minorities? If someone asked to use this model to award an academic scholarship and it was trained on race, would you feel comfortable letting them use it? X1TXMTH, X2MEFFORT, or X2SEFFORT, to see whether or not the student's aptitude and diligence towards pursuing math and science has an effect on their completion rate in the future. Since these variables were not thoroughly examined in our exploratory analysis phase, we may have to go back and perform more analyses to see if there appears to be a correlation between them and college completion rate. Yet another example of the pipeline being more like an escape room!X3GPATMAT, X3GPATSTEM, or another similar variable, which could provide insight into how high school GPA affects college retention rate. Of course, we should also worry that this data point could become self-reinforcing. If our model detects a strong pattern between graduation rate and STEM GPA in our training data, for example, then it may become overreliant on that single variable and guess graduation rate solely on this variable alone. S1EDUEXPECT and S2SUREDIPL to see whether or not a student's perception of their academic future influences their college retention. Note here that the ability to complete college is influenced by a variety of factors, including social, economic, and health reasons. Again, if someone asked you to use this model to determine a scholarship recipient, could you ensure that you weren't turning away the students who needed it most?As you can hopefully see, the decision space of variables for a machine learning model, especially in such an example as this, has to be carefully curated before moving forward. Understand the framework in which you are using a machine learning model, what it will be used for, and why you are letting the model train on an individual variable.
You can read more about the kinds of questions scientists are asking about machine learning ethics in this article by Nature. Additionally, one can read more about the ways in which machine learning algorithms can and will discriminate if metrics are not carefully regulated.
Once one has decided to move forward with creating a machine learning model, there are a variety of tools available to the user. One of the most popular is scikit-learn, a library designed specifically for machine learning models. You can find documentation for scikit-learn at this website. It is very extensive--for the budding data scientist, I recommend looking at the tutorials.
We have completed our analysis of the data, and I imagine that you, intrepid reader, have learned a thing or two about data science principles! Unless you are my TA, of course, in which case you almost certainly knew all of this stuff.
Quickly, let's recap what we learned about data science and our data along the way:
pandas, a numpy-based library for working with large data sets.pandas to get our database down to a human-workable format.matplotlib, which helped give us a visual perspective on our data and gave a clearer picture when it came to relative sizes and proportions.scipy and statsmodels and tested various ideas regarding how sex and race play a role in education.So, what have we learned? Well, it appears that race does have a significant effect on graduation rates and STEM achievement of a student; I am willing to be that, with more hypothesis testing, we can make the same conclusions about sex, and if we had the data, gender identity.
So, now where do we go from here? We've completed our data science, but there's still more to be done. If you are particularly interested, you can look into ways of doing more analysis on this data, or maybe you would like to compile a large list of data to perform policy solutions. The world is your oyster at this point. Data Science is primarily a tool that can help you, rather than being its own science! It is interdisciplinary by nature!
And hopefully you, future data scientist, have learned much about what data science is, and how to do it! I look forward to seeing your own project soon enough.
### Written by: Jakob Wachter
### For: CMSC320, Introduction to Data Science
### Thanks to the TA who graded this :)