Final project
Timeline
Research topics due Tuesday September 30 at 5:00pm
Project proposal due Wednesday October 15 at 5:00pm
Exploratory data analysis due Tuesday November 4
Project presentations (in lab) Wednesday November 12
Draft report due Tuesday November 18 at 5:00pm
Peer review Wednesday November 19, in lab
Final written report + reproducible GitHub repository due December 12 at 5:00pm.
Description
The goal of the final project is for you to use regression analysis to analyze a data set of your own choosing. The data set may already exist or you may collect your own data by scraping the web.
Choose the data based on your group’s interests or work you all have done in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like!) and apply them to a data set to analyze it in a meaningful way.
All analyses must be done in RStudio using Quarto and GitHub, and your analysis and written report must be reproducible.
Deliverables
You will work on the project with your lab groups. The primary deliverables for the project are
an in-person presentation about the exploratory data analysis and initial modeling
a written, reproducible final report detailing your analysis
a summary of your project highlights to share with the class
a GitHub repository containing all work from the project
There are intermediate milestones and peer review assignments throughout the semester to help you work towards the primary deliverables.
There are intermediate milestones and peer review assignments throughout the semester to help you work towards the primary deliverables.
Step 1
Research topics
The goal of this milestone is to discuss topics and develop potential research questions your team is interested in investigating for the project. You are only developing ideas at this point; you do not need to have a data set identified right now.
Develop three potential research topics. Include the following for each topic:
- A brief description of the topic
- A statement about your motivation for investigating this topic
- The potential audience(s), i.e., who might be most interested in this research?
- Two or three potential research questions you could analyze about this topic. (Note: These are draft questions at this point. You will finalize the questions in the next stage of the project.)
- Ideas about the type of data you might use to answer this question or potential data sets you’re interested in using. (Note: The goal is to generate ideas at this point, so it is fine if you have not identified any particular data sets at this point.)
Turn-in: Write your responses in research-topics.qmd in your team’s project GitHub repo. Push the qmd and rendered pdf documents to GitHub by the deadline, Tuesday, September 30 at 5:00pm. There is no Gradescope submission.
Project proposal
The purpose of the project proposal is for your team to identify the data set you’re interested in analyzing to investigate one of your potential research topics. You will also do some preliminary exploration of the response variable and begin thinking about the modeling strategy. If you’re unsure where to find data, you can use the list of potential data sources here as a starting point.
You must use the data set(s) in the proposal for the final project, unless instructed otherwise when given feedback.
The data set must meet the following criteria:
At least 500 observations
At least 10 columns, such that at least 6 of the columns are useful and unique predictor variables.
e.g., identifier variables such as “name”, “ID number”, etc. are not useful predictor variables.
e.g., if you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique predictors.
At least one variable that can be identified as a reasonable response variable.
- The response variable can be quantitative or categorical.
A mix of quantitative and categorical variables that can be used as predictors.
May not be data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
The project proposal should include the following:
- A statement of a research question.
- Why your research question is important.
- A description of your data
- A description of how your data will be used to answer your research question
- A description of your planned analysis (what is the outcome variable, relevant predictors, regression technique, e.g. multiple linear regression for continuous outcome vs logistic regression for a binary outcome). We will cover logistic regression in the coming weeks but you can read ahead about it here.
Turn-in: submit a pdf proposal to Gradescope by the deadline.
Exploratory Data Analysis (EDA)
The purpose of this milestone is to explore the data early and get feedback on your data and analyses. You will submit a draft of the beginning of your report that includes the introduction and exploratory data analysis, with an emphasis on the EDA. It will also help you prepare for the presentation of the exploratory data analysis results.
Below is a brief description of the sections to include in this step:
Introduction
This section includes an introduction to the project motivation, background, data, and research question.
Exploratory Data Analysis
This section includes the following:
Description of the data set and key variables.
Exploratory data analysis of the response variable and key predictor variables. This includes visualizations, summary statistics, and narrative. Include:
- Univariate EDA of the response and key predictor variables.
- Bivariate EDA of the response and key predictor variables
- Potential interaction effects
Turn-in: Write your draft introduction and exploratory data analysis in the report.qmd file in your team’s GitHub repo. Push the qmd and rendered pdf documents to GitHub by the deadline.
Because this question is often asked: how long should this section of the report be? Usually, the anticipated length of the EDA section, including all graphs, tables, narrative, etc. with code, warnings, and messages suppressed is about 3-5 pages. If you exceed the limit, format your figures to be smaller.
Step 2
Presentations
Your team will do an in-person presentation that summarizes and showcases the work you’ve done on the project thus far. Because the presentations will take place while you’re still working on the project, it will also be an opportunity to receive feedback and suggestions as well as provide feedback to other teams. The presentation will focus on introducing the subject matter and research question, showcase key results from the exploratory data analysis, and discuss primary modeling strategies and/or results. The presentation should be supported by slides that serve as a brief visual addition to the presentation. The presentation and slides will be graded for content and clarity.
You can create your slides with any software you like (e.g., Keynote, PowerPoint, Google Slides, etc.). You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!
The presentation is expected to be between 5 to 6 minutes. It may not exceed 6 minutes.
Every team member is expected to speak in the presentation. Part of the grade will be whether every team member had a meaningful speaking role in the presentation.
Suggested template:
Slides
The slide deck should have no more than 5 content slides + 1 title slide to ensure you have enough time to discuss each slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.
Title Slide
Slide 1: Introduce the subject, motivation, and research question
Slide 2: Introduce the data set
Slide 3 - 4: Highlights from the EDA (be sure to include EDA for the response variable!)
Slide 5: Initial modeling strategies / results (if applicable) / next steps and anything you’d like feedback on
Turn-in: Put a PDF of the slides in a folder titled “presentation” in your team’s GitHub repo. Push the slides before your lab section on the presentation day.
Step 3
Analysis + peer review
The purpose of the draft and peer review is to give you an opportunity to get early feedback on your analysis. Therefore, the draft and peer review will focus primarily on the exploratory data analysis, modeling, and initial interpretations.
Draft report
Write the draft in the report.qmd file in your project repo.
Below is a brief description of the sections to focus on in the draft:
Introduction and data
This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the body of the report, so focus on the EDA for the response variable and a few other interesting variables and relationships.
Methodology
This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, any variable transformations (if needed), and any other relevant considerations that were part of the model fitting process.
Results
In this section, you will output the final model and include a brief discussion of the model assumptions, diagnostics, and any relevant model fit statistics.
This section also includes initial interpretations and conclusions drawn from the model.
Grading
The draft will be graded based on whether there is demonstration of a reasonable attempt at each of the sections described below in the written report file in the GitHub repo by the deadline.
Peer review
The peer review is due on Friday, November 21 at 11:59pm.
Critically reviewing others’ work is a crucial part of the scientific process, and STA 221 is no exception. Each lab team will be assigned two other teams’ projects to review. Each team should push their draft to their GitHub repo by 5 pm the day before lab. The lab that week will be dedicated to the peer review, so your team will have time to review and provide quality feedback to two other teams.
During the peer review process, you will be provided read-only access to your partner teams’ GitHub repos. Provide your review in the form of GitHub issues to your partner team’s GitHub repo using the issue template provided in the repo.
Steps for peer review
Go to the Canvas announcement to see the teams you’re peer reviewing.You’ll spend about 30 minutes reviewing each project.
When you get to lab, you should have access to the GitHub repos for the teams you’re reviewing. In GitHub, search the repositories for
project, and you should see the repos for the projects you’re reviewing. You will be able to read the files in the repo and post issues, but you cannot push changes to the repo. You will have access to the repo until the deadline for the peer review.You may choose to all work on both peer reviews or have some team members focus on a single peer review. Either way there will be one peer review grade assigned per team.
For each team you’re reviewing:
Open that team’s repo, read the project draft, and browse the rest of the repo.
Go to the Issues tab in that repo, click on New issue, and click on Get started for the Peer Review issue. Write your responses to the prompts in the issue. You will answer the following questions:
Describe the goal of the project.
Describe the data set used in the project. What are the observations in the data? What is the source of the data? How were the data originally collected?
Consider the exploratory data analysis (EDA). Describe one aspect of the EDA that is effective in helping you understand the data. Provide constructive feedback on how the team might improve the EDA.
Describe the statistical methods, analysis approach, and discussion of model assumptions, diagnostics, model fit.
Provide constructive feedback on how the team might improve their analysis. Make sure your feedback includes at least one comment on the statistical modeling aspect of the project, but also feel free to comment on aspects beyond the modeling.
Provide constructive feedback on the interpretations and initial conclusion. What is most effective in the presentation of the results? What additional detail can the team provide to make the results and conclusions easier for the reader to understand?
What aspect of this project are you most interested in and think would be interesting to highlight in the written report?
Provide constructive feedback on any issues with file and/or code organization.
(Optional) Any further comments or feedback?
Grading
The peer review will be graded on the extent to which each comprehensively and constructively addresses the components on the peer review form. There will be one peer review grade per team.
Step 4
Written report
Your written report must be completed in the report.qmd file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.
Before you finalize your write up, make sure the code chunks are not visible and all messages and warnings are suppressed.
You will submit the PDF of your final report on GitHub.
The PDF you submit must match the .qmd in your GitHub repository exactly. The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including tables and visualizations, must be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis and report.
Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.
You are welcome to include an appendix with additional work at the end of the written report document; however, grading will overwhelmingly be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.
Introduction and data
This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.
Grading criteria
The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description about how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.) The explanatory data analysis helps the reader better understand the observations in the data along with interesting and relevant relationships between the variables. It incorporates appropriate visualizations and summary statistics.
Methodology
This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, interactions considered, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.
Grading criteria
The analysis steps are appropriate for the data and research question. The group used a thorough and careful approach to select the variables in the final model; the approach is clearly described in the report. The model selection process took into account potential interaction effects and addressed any violations in model conditions. If violations of model conditions are still present, there was a reasonable attempt to address the violations based on the course content.
Results
This is where you will output and discuss the final model.
Describe the key results from the model. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.
Grading criteria
The model fit is clearly assessed, and interesting findings from the model are clearly described. The model conditions and diagnostics are thoroughly and accurately assessed for the final model, if not previously discussed in the methodology. Interpretations of model coefficients are used to support the key findings and conclusions, rather than merely listing the interpretation of every model coefficient. If the primary modeling objective is prediction, the model’s predictive power is thoroughly assessed.
Discussion + Conclusion
In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.
Grading criteria
Overall conclusions from analysis are clearly described, and the model results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.
Organization + formatting
This is an assessment of the overall presentation and formatting of the written report.
Grading criteria
The report neatly written and organized with clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted and labeled. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix) is no longer than 10 pages.
Reproducible GitHub Repo.
All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.
The GitHub repo should have the following structure:
README: Project title and team name- Short project summary
report.qmd&report.pdf: Final written reportproposal.qmd&proposal.pdf: Project proposalresearch-topics.qmd&research-topics.pdf: Proposed research questions/data: Folder that contains all data used for the final project./data/README.md: Data dictionary and source for data set
project.Rproj: File specifying the RStudio project/presentation: Folder with the presentation slides or link to slides..gitignore: File that lists all files that are in the local RStudio project but not the GitHub repo/.github: Folder for peer review issue templateAny other files should be neatly organized into clearly labeled folders.
Update the README of the project repo with your project title and team members’ names.
Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.