πŸ’» Project #3: Exploratory Data Analysis

Project Overview

This is a cumulative project focusing on Exploratory Data Analysis (EDA) using Python’s pandas, matplotlib, and seaborn libraries. The project aims to demonstrate skills in data wrangling, visualization, and communication of findings through a scientific conference-style poster. You will work with a large, real-world dataset of your choice to perform EDA.

The long-term project consists of THREE main deliverables:

  1. A descriptive markdown text file
  2. A python code script
  3. An academic conference-style poster

    πŸ“° Make a copy of this Google Slide Template for the poster!

πŸ’» PROJECT PROGRAM SETUP INSTRUCTIONS
  1. Go to the public template repository for our class: BWL-CS Python Template
  2. Click the button above the list of files then select Create a new repository
  3. Specify the repository name: CS3-Project-EDA
  4. Click

    Now you have your own personal copy of this starter code that you can always access under the Your repositories section of GitHub! πŸ“‚

  5. Now on your repository, click and select the Codespaces tab
  6. Click Create Codespace on main and wait for the environment to load, then you’re ready to code!

πŸ›‘ When class ends, don’t forget to SAVE YOUR WORK! Codespaces are TEMPORARY editing environments, so you need to COMMIT changes properly in order to update the main repository for your program.

There are multiple steps to saving in GitHub Codespaces:

  1. Navigate to the Source Control menu on the LEFT sidebar
  2. Click the button on the LEFT menu
  3. Type a brief commit message at the top of the file that opens, for example: updated main.py
  4. Click the small βœ”οΈ checkmark in the TOP RIGHT corner
  5. Click the button on the LEFT menu
  6. Finally you can close your Codespace!

Instructions & Requirements

Dataset Sources:

β‘  Markdown file:

  • Dataset:
    • Provide a link to the dataset you chose.

      In markdown syntax, you can format links like this: [Link text](Link address)

  • Column Descriptions:
    • Write a brief description of each column in your dataset. Include data types and potential values.
  • Hypotheses/Questions:
    • List at least 5 questions you want to explore using your dataset.

      Is there a correlation between two variables? How do values in a specific column change over time or categories?

  • Visualization Plan:
    • Explain how you will use visualizations to test your hypotheses. Include the type of chart you plan to use for each question.

β‘‘ Python script:

Your main.py script should be well-organized and demonstrate that you performed meaningful data analysis and generated diverse visualizations.

  • Load & process the dataset
    • Handle missing values.
    • Rename columns if necessary for clarity.
    • Convert data types if needed.
    • Filter and/or group data for focused analysis.
  • Contain code for at least 4 different types of visualizations
    • Ensure each visualization is well-labeled with titles, axis labels, and legends.
    • Visualizations should be clear, informative, and appropriate for the data (see guidelines below)

      See the Example Graph Gallery for inspiration.

  • Be thoroughly commented
    • Explain the purpose of each section of the code.
    • Use functions for reusable components of your code.

β‘’ Poster:

A scientific poster for an academic conference must effectively communicate key findings. All captions and text should be concise and relevant.

  • Introduction
    • Context for the dataset and why it is interesting/relevant.
    • Clearly state your research questions or hypotheses.
  • Methods
    • Describe your process for cleaning and analyzing the data.
    • Include screenshots of key Python code snippets.
  • Results
    • Present 2–3 visualizations from your analysis.
    • Provide clear captions for each figure.
  • Discussion
    • Interpret your findings and discuss patterns or trends observed in the data.
    • Mention any limitations of your analysis.
  • Conclusions
    • Summarize key takeaways.
    • Suggest potential areas for future research or data exploration.
  • References
    • Cite the dataset and any external resources

      πŸ“š Conduct research on the topic to enhance your poster’s Introduction and/or Discussion sections!


Choosing Appropriate Visualizations

πŸ’‘ Choosing the right type of visualization is crucial for effectively communicating your findings. Below are some guidelines to help you decide. Check out this resource from UC Berkeley for additional tips.

Example Chart Selection Table

Question Type Recommended Chart Type
Proportions within a whole Pie chart, Stacked bar plot, Word cloud
Trends over time, sequential events Line plot, Animated plots
Distribution of a variable Histogram, Box plot (comparing between groups)
Comparison across categories Bar plot, Grouped bar plot (catplot)
Relationships between variables Scatter plot (two variables), Heatmap (multiple variables)

image