Into the Weeds

Ongoing Learning and Project Clarity

Since Post 1, my time investment in this project has varied between learning data science at a high level and expanding my Python foundation through tutorials and project work. I have been working my way through the book, Data Science for Business, which nicely balances emphasis between high level discussion and gritty nuance. Importantly, the text provides language for speaking to my interests and experiences in analytics and defining my project stages and scope.

The analysis I’m conducting involves profiling and causal modeling. I want to know which data points correlate most strongly to student outcomes and AmeriCorps Member performance (see my previous post for information on my role at an educational non-profit). I am hoping I can then build a profile of successful AmeriCorps Members (ACMs) and use this model to recommend metrics and goals that better align to student outcomes and ACM performance. Additionally, this analysis could help advocate to school partners for the school settings in which our AmeriCorps Members best serve students’ individualistic needs.

On the technical Python side, I have gained further experience with tuples, dictionaries, sets, and with control flows while also practicing other basic types and pandas techniques I already knew well. The tutorials I followed include an Exploratory Data Analysis from PyCon, the seaborn data visualization tutorial, and a little natural language processing.

Although these tutorials provided a sense of potential and accomplishment, the feeling quickly faded in the absence of application. Of course, this is where the ‘hard learning’ occurs and meaningful progress is made. Following are two examples of the data I have processed in my project to date.

Getting Data

First I needed to get my data into Python. Most of of my data are stored on Salesforce or Excel workbooks through SharePoint. Going into this project, I had already developed Python systems for getting and writing Excel documents to SharePoint. This was accomplished with a mapped network drive to the SharePoint server, allowing me to simply reference and navigate the file structure as a local drive.

The Salesforce component was a bit more tricky, but with a little Python elbow grease, the simple-salesforce package was more than enough. This package allows me to query, create, and delete records, which has had a profound effect on my work practices. In a typical use of simple-salesforce, the user passes a traditional SOQL query into the Salesforce instance (in this case cysh) and performs the query_all function. Here is the set-up:

And a simple query of assessment records, for example:

In my case, I’m working with datasets that are not very large. In the Chicago region, we serve fewer than 3000 students and enlist fewer than 300 AmeriCorps Members. Therefore I can query the entire Salesforce object (i.e. Assesment__c) with whatever fields I’m interested in and perform all the shaping and filtering in Python.

I added this convenient function to iterate over the query response and shape it as a dataframe:

Usage of this function:

In terms of assessment data, the next step will be determining student assessment progress by comparing prior year or start of year assessments to the recent winter assessment. Typically we calculate assessment goals from fall to spring, but in this analysis I need to calculate fall to winter goals. Ideally this analysis will also consider data from past years, but that may be an unreasonable lift. To view my current progress on assessment data, head over to the IPython notebook in my CityYear-ACM-Analysis repo.

Cleaning Data

In my organization we employ four coaches who observe, provide feedback, and enter data for each AmeriCorps member at least once per month. If this data were cleaned and rolled together, I could determine whether the AmeriCorps Members who exercise best practices in tutoring are in fact more impactful with students (among many other interesting questions). The reason I put off this analysis is because the 26 Excel workbooks for tracking this data were not designed with aggregation in mind. The records are indexed by first names and nicknames, headers are inconsistent (and multi-indexed), and the file structure does not reflect standardized organization.

The first task was to identify the file paths to each Excel workbook. This was accomplished using os.walk() to list all contents of the relevant SharePoint directories. Since there are only 26 files, and the filenames are not patterned, I chose to simply copy-paste the file paths I wanted into a list.

The next challenge was that these Excel workbooks do not contain consistent sheets. All workbooks contain months as sheet names, but some coaches split their workbook into two semesters. Instead of pandas’ pd.read_excel(), I used pd.ExcelFile(). This allowed me to load the workbook once, then only pull the sheets whose names also occur in the set of sheets I’m interested in:

To this process, I also added columns for the month (the sheet name) and the school (the directory in which the file resides), but again the school folder names were not standardized.

Enter fuzzywuzzy, a very handy package for finding and scoring text matches. I performed fuzzy matches between the folder name columns and the official school names from Salesforce like so:

Using the same process, I was able to clean ACM names too. However, in this case there were several ambiguous matches, which I resolved by manually comparing names between the Excel workbook and the validated set of ACM names from Salesforce. I then fixed the mis-matched names in Excel by hand. You can also view this full process in the project repository.

Next Steps

Next in my analysis will be cleaning and combining 2-3 more data sources. Then I will conduct exploratory data analysis, cyclically chasing my branching curiosities as they come. For example, I will experiment with various definitions of ACM success to tentatively determine which features seem worthwhile to explore further.