How Long Should My Dissertation Be?

First off, check out this great blog post from “R is my friend”:

https://beckmw.wordpress.com/2013/04/15/how-long-is-the-average-dissertation/

This blog post shows how to web scrape a university library website to collect the page lengths of all dissertations and analyze them using R.  Isn’t that nifty?  Now, wouldn’t you like to see it again – in Python?

I came across this post while working on my dissertation.  I wanted to use it to examine the length of papers at my own institution.  Unfortunately, while the University of Minnesota’s Digital Conservancy system maintained the number of pages for each dissertation, my university’s library system did not.  My institution uses a system called Digital Commons instead.  Fortunately, Digital Commons shares similarities with U of M’s library system which does allow for web scraping and a way to get the number of pages via downloading and reading the pdf document.

Rather than adapt the R code for this, I opted to use Python.  There are lots of ways to do web scraping and data analysis.  I could have used Powershell for the scraping and SPSS for the analysis.  However, R and Python are the go-to statistical programming languages and I used Python in my dissertation, so I wanted to stick with what I know.

Part One – Web Scraping

Finding Digital Commons institutions is a simple matter of Google searching “site:dc.*.edu”.  From there, the URLs follow a typically format of https://dc.university.edu/collection/1/.  Some institutions divide their collections up by college or degree.  However, there is a small set of schools that do not do this.  All of the master theses and doctoral dissertations are lumped together.  My institution is one of those schools fortunately.  Determining the range of the collection is a simple URL manipulation task.  Start by checking for c.  Adjust the range accordingly until you find the start and end of the document pages.

Each page has a standard set of meta tags and paragraphs which contain valuable data you might want.  I pulled paper ID, keywords, department, institution, degree, author name, committee chair, degree award date, paper title, degree award year, the URL for the pdf document, the URL for the current (landing) page, and the date the paper was put online.  I pulled more than I needed for this project, but the data might be useful for future analyses.

The URL for the pdf document is important for two reasons.  First of all, the paper ID from the landing URL is not necessarily the same as the article ID for the URL of the pdf document.  So you might visit https://dc.university.edu/collection/5/ to collect details about the paper, but the pdf URL could be https://dc.university.edu/cgi/viewcontent.cgi?article=35&context=collection.  Therefore, you need the pdf URL so you can link the landing page and pdf together.  The second reason the URL for the pdf document is important is that’s how you’re going to get the page count.  You’ll have to download the pdf, read it, and pull out the page count.  This means that you need about 5-10 GB of free space to hold all the pdfs, or you could change the code I have to just delete the pdf after you’ve gotten the page count from it.  I don’t like having a long running program deleting files, and I like having all the pdfs anyways, so my code keeps the pdfs around.  And make no mistake… since you’re having to download pdfs, this code is going to run for a long time.  I’d recommend just kicking it off overnight and check it in the morning.

Some pdfs are embargoed since the author wants to publish articles based on their work or their work contains sensitive information.  When my code encounters one of those pages it prints out an error message to the console.  I could have redirected that to an error file, but I didn’t run into too many like that, so I didn’t invest the 2 minutes of writing code for an error file.

So here is the Python code for the web scraping:

from bs4 import BeautifulSoup
import urllib2, PyPDF2, cgi, os

savedir = "C:\\pdf_project\\university\\"
BASEURL = "http://dc.university.edu/stads/"

print "create output file"
outputFile= open(savedir + "dissertation_thesis_info.university.csv","w+")
outputFile.write("PaperID, Keywords, Department, Institution, Degree, Author, Chair, Award_Date, Title, Year, " +
" PDF_URL, Landing_URL, Online_Date, PDF_Saved_Filename, PDF_File_Size_KB, PDF_Num_Pages\n")

def get_meta_value(bs, name_str):
    return_val = ''
    for sval in bs.find_all("meta", attrs={'name':name_str}):
        if sval.has_attr('content'):
            return_val=sval.attrs['content']
    return return_val

def get_paragraph_value(bs, name_str):
    return_val = ''
    for sval in soup.find_all("div", attrs={'id':name_str}):
        for subval in sval.find_all('p'):
            return_val=subval.next
    return return_val

#The range for University is 1 - 400
for iCount in range (1, 400):
    keywords=institution=degree=author=title=paper_date=pdf_url=landing_url=online_date=department=chair=award_date=''<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

    r=urllib2.urlopen(BASEURL + str(iCount)).read()
    soup=BeautifulSoup(r)

    keywords=get_meta_value (soup, 'keywords')
    institution=get_meta_value(soup, 'bepress_citation_dissertation_institution')
    degree=get_meta_value(soup, 'bepress_citation_dissertation_name')
    author=get_meta_value(soup, 'bepress_citation_author')
    title=get_meta_value(soup, 'bepress_citation_title')
    paper_date=get_meta_value(soup, 'bepress_citation_date')
    pdf_url=get_meta_value(soup, 'bepress_citation_pdf_url')
    landing_url=get_meta_value(soup, 'bepress_citation_abstract_html_url')
    online_date=get_meta_value(soup, 'bepress_citation_online_date')
    department=get_paragraph_value(soup, 'department')
    chair=get_paragraph_value(soup, 'advisor1')
    award_date=get_paragraph_value(soup, 'publication_date')

    if pdf_url:
        try:
            remotefile = urllib2.urlopen(pdf_url)
            header = remotefile.info()['Content-Disposition']
            value, params = cgi.parse_header(header)
            filename = os.path.splitext(params['filename'])[0]

            data = remotefile.read()
            pdf_size = data.__len__() / 1024    #KB

            with open(savedir + str(iCount) + ".pdf", "wb") as pdfFile:
                pdfFile.write(data)

            pdfFileName=str(iCount) + ".pdf"

            pdfFileObj = open (savedir + pdfFileName, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

            pdf_pages=str(pdfReader.numPages)

            outputFile.write(str(iCount).encode('utf-8') + ",\"" + keywords.encode('utf-8') + "\",\"" + department.encode('utf-8') + "\",\"" +
                             institution.encode('utf-8') + "\",\"" + degree.encode('utf-8') + "\",\"" + author.encode('utf-8') + "\",\"" +
                             str(chair).encode('utf-8') + "\",\"" + str(award_date).encode('utf-8') + "\",\"" +
                             title.encode('utf-8') + "\",\"" +  paper_date.encode('utf-8') + "\",\"" + pdf_url.encode('utf-8') + "\",\"" +
                             landing_url.encode('utf-8') + "\",\"" + online_date.encode('utf-8') + "\",\"" + pdfFileName.encode('utf-8') + "\",\"" +
                             str(pdf_size).encode('utf-8') + "\",\"" + str(pdf_pages).encode('utf-8') + "\"\n")
        except:
            print "error - skip this one... " + str(iCount)

print "close output file"
outputFile.close()

I used BeautifulSoup for the web scraping and PyPDF2 for reading the number of pages from the saved pdf documents.  While I was at it, I grabbed the filename of the pdf and the size of the document.  I had a hypothesis that over time students were adding more images to their papers, padding out their paper sizes without the pesky need for actual writing.  Initial data exploration did not bear this out though.  Still, there might be other ways that file size could be used in the analysis phase.

Part Two – Analysis

Now that I had a nice csv file with several data points including the number of pages for each pdf, the next step was to analyze the data.  I had papers from over 40 departments and 16 years.  To be honest, Pivot tables / charts in Excel work great for exploratory data analysis.  However, I liked the histograms and box plots from the blog post that inspired this one, so I tried to code those in Python.  I really like how the graphs in R showed descriptive statistics including sample size and median.  It’s easy to look at the box and whisker plot without those descriptive statistics and come to some poor conclusions.  I’ll get into that a little bit further down.

I chose to limit the dataset to papers after 2009 and those with less than 501 pages.  There were only two papers with more than 500 pages and including them slightly messed up the box and whiskers graphs.  I limited the set to 2010 and beyond to capture the coveted six year graduation rate.  The six year graduation rate would better apply to undergraduate programs, but it provides a convenient cut off mark for what I’m currently looking at.  The final set of papers numbered 1,177.

I really liked the box and whisker plots, so I started with those.  Below you will see what Python can do with the Seaborn library.  First is the box and whisker plot sorted by department and degree.  Second is the same set sorted by median page length.

bw1

bw2

It looks like Environmental Health is crushing it when it comes to writing.  This is where we return to the usefulness of descriptive statistics.  Even though Environmental Health PhDs are far outpacing everyone else, it turns out there have only been five since 2010.  The one that I really like is Computer and Information Science.  The range is really broad, going from below 50 pages to over 300.  However, the median turns out to be 86 pages. So what’s going on there?  I have a Masters from that department although I opted for the capstone option rather than a thesis – so I have some context for what is going on based on classmates and alumni who did go the thesis route.  It’s a simple explanation: the high page counts are coming from folks putting their code in the appendix.

Department count mean std min 25% 50% 75% max
Allied Health 25 80.12 30.10 49 62.00 75.0 87.00 174
Appalachian Studies 5 106.00 25.76 77 96.00 98.0 113.00 146
Art 35 53.66 20.93 21 43.50 52.0 62.50 137
Biology 98 87.09 38.33 31 61.00 76.0 107.00 253
Biomedical Sciences 35 145.86 56.79 70 113.00 136.0 165.50 323
Chemistry 66 92.33 50.10 43 62.00 72.0 115.25 286
Clinical Nutrition 9 76.44 17.31 48 67.00 75.0 89.00 99
Communication, Professional 47 70.28 25.66 26 54.00 64.0 90.50 138
Communicative Disorders 9 83.44 22.43 45 77.00 82.0 92.00 119
Computer and Information Science 13 125.69 79.72 46 71.00 86.0 170.00 305
Counseling 1 96.00 NaN 96 96.00 96.0 96.00 96
Criminal Justice and Criminology 38 88.47 35.09 43 68.00 82.0 97.25 192
Early Childhood Education 14 174.21 74.89 58 130.75 172.5 199.75 366
Educational Leadership 292 121.93 36.93 66 98.00 111.5 134.50 351
English 29 82.52 40.68 49 60.00 67.0 85.00 206
Environmental Health 5 183.00 128.91 89 95.00 127.0 206.00 398
Geosciences 33 120.97 53.59 63 80.00 108.0 146.00 284
History 75 103.60 36.54 49 79.50 96.0 120.00 281
Kinesiology and Sport Studies 18 108.61 45.83 45 71.50 103.5 131.00 205
Liberal Studies 18 110.56 28.78 67 93.75 107.5 119.50 189
Mathematical Sciences 80 63.54 33.37 31 46.75 57.5 71.50 287
Nursing 24 150.54 51.85 83 111.75 136.5 183.75 286
Psychology 75 99.69 32.55 41 78.00 97.0 116.50 216
Public Health 28 135.04 51.84 52 108.25 129.5 164.50 245
Reading 9 116.56 48.63 69 87.00 91.0 133.00 223
Sociology 31 62.77 18.83 35 48.00 58.0 74.00 101
Special Education 6 67.33 13.49 50 56.00 70.5 78.25 81
Sport Physiology and Performance 27 133.26 35.58 53 107.50 133.0 150.50 210
Sports Science and Coach Education 3 161.00 112.53 83 96.50 110.0 200.00 290
Technology 29 83.24 37.17 49 60.00 74.0 94.00 220

This is where the inclusion of sample size and median on the R graphs is really helpful.  I was able to create the column charts and histograms, but not with those data points.  Given the flexibility of Python, I’m sure there is a way to produce these graphs with sample size, median, and an entertaining short story.  However, I was going for speed and the least amount of complexity.  Here are the column charts and histograms that I was able to produce.

overall
Number of Pages Overall
by_dept
Number of Pages by Selected Department
by_month
Awards by Month

To summarize, most people at my institution graduate in May and they write around 100 to 150 pages for dissertations.  They write between 50 and 100 pages for theses.  Here is the code for this fun little project.

import pandas as pd
import seaborn as sns
from scipy.stats import norm

df = pd.read_csv('C:\\pdf_project\\university\\dissertation_thesis_info.university.csv')
df["DeptDeg"] = df[" Department"] + " - " + df[" Degree"]
df['Month'] = df[" Award_Date"].apply(lambda x: x.split('-')[1])

df = df[(df[' Year'] >= 2010) & (df[' PDF_Num_Pages'] <= 500)]
df_temp = df[df[' Department'].isin(['Environmental Health', 'Educational Leadership', 'Computer and Information Science', 'Nursing'])]

mo = df.groupby(by=["DeptDeg"])["DeptDeg"].size()
mo.sort_index(ascending=True)

# get descriptive statistics
print df[' PDF_Num_Pages'].describe()

grouped_data = df.groupby([' Department'])
print grouped_data[' PDF_Num_Pages'].describe().unstack()

#display overall histogram
his_pages = sns.distplot(df[' PDF_Num_Pages'], fit=norm, kde=False)
sns.plt.show()

#display histograms for a selection of departments
g = sns.FacetGrid(df_temp, col=' Department', col_wrap=2)
g.map(sns.distplot, ' PDF_Num_Pages');
sns.plt.show()

#display a column chart for when degrees are awarded
cp_month = sns.countplot(x="Month", data=df)
sns.plt.show()

#display a box and whisker plot, unordered
sns.plt.figure(figsize=(10,10))
sns.boxplot(x=' PDF_Num_Pages', y='DeptDeg', data=df, order=mo.index)
sns.plt.subplots_adjust(left=0.25)
sns.plt.show()

#display a box and whisker plot, ordered by median
mo = df.groupby(by=["DeptDeg"])[" PDF_Num_Pages"].median().iloc[::-1]
mo.sort(ascending=0)
sns.plt.figure(figsize=(10,10))
sns.boxplot(x=' PDF_Num_Pages', y='DeptDeg', data=df, order=mo.index)
sns.plt.subplots_adjust(left=0.25)
sns.plt.show()

Recommendations for Future Research

There are numerous research opportunities with this type of data.  I’ve found a small handful of other institutions that use Digital Commons in the same way as my university.  I could compare the number of pages per dissertation and thesis across institution and combine that data with IPEDS data.  I could do a time series analysis of just the papers from my institution to determine if the number of pages has been decreasing over time.  I could focus on a particular department.  I could further explore the hypothesis concerning padding papers out with graphics.

I did not explore PyPDF2 fully, but maybe it could be used to catalog table of contents from papers.  The table of contents typically has five chapters and a set of references.  If I could grab the page numbers of the chapters from the table of contents then I could perform further analysis about page lengths and the amount of effort put into dissertations and theses.  It would go a long way towards teasing out how much actual effort is being put into research in these papers versus simply padding them out with appendices and graphics.

I hope you have enjoyed this post.  I hope it helps you with your own research projects.  If you want to chat, drop me a note.  I’d love to hear your ideas on the subject.  Best wishes!

– Dr. J

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s