π» Project #3: Exploratory Data Analysis
Project Overview
This is a cumulative project focusing on Exploratory Data Analysis (EDA) using Pythonβs pandas, matplotlib, and seaborn libraries. The project aims to demonstrate skills in data wrangling, visualization, and communication of findings through a scientific conference-style poster. You will work with a large, real-world dataset of your choice to perform EDA.
The long-term project consists of THREE main deliverables:
- A descriptive markdown text file
- A python code script
- An academic conference-style poster
π° Make a copy of this Google Slide Template for the poster!
π» PROJECT PROGRAM SETUP INSTRUCTIONS
- Go to the public template repository for our class: BWL-CS Python Template
- Click the button above the list of files then select
Create a new repository - Specify the repository name:
CS3-Project-EDA - Click
Now you have your own personal copy of this starter code that you can always access under the
Your repositoriessection of GitHub! π - Now on your repository, click and select the
Codespacestab - Click
Create Codespace on mainand wait for the environment to load, then youβre ready to code!
π When class ends, donβt forget to SAVE YOUR WORK! Codespaces are TEMPORARY editing environments, so you need to COMMIT changes properly in order to update the main repository for your program.
There are multiple steps to saving in GitHub Codespaces:
- Navigate to the
Source Controlmenu on the LEFT sidebar - Click the button on the LEFT menu
- Type a brief commit message at the top of the file that opens, for example:
updated main.py - Click the small
βοΈcheckmark in the TOP RIGHT corner - Click the button on the LEFT menu
- Finally you can close your Codespace!
Instructions & Requirements
Dataset Sources:
β Markdown file:
- Dataset:
- Provide a link to the dataset you chose.
In
markdownsyntax, you can format links like this:[Link text](Link address)
- Provide a link to the dataset you chose.
- Column Descriptions:
- Write a brief description of each column in your dataset. Include data types and potential values.
- Hypotheses/Questions:
- List at least 5 questions you want to explore using your dataset.
Is there a correlation between two variables? How do values in a specific column change over time or categories?
- List at least 5 questions you want to explore using your dataset.
- Visualization Plan:
- Explain how you will use visualizations to test your hypotheses. Include the type of chart you plan to use for each question.
β‘ Python script:
Your main.py script should be well-organized and demonstrate that you performed meaningful data analysis and generated diverse visualizations.
- Load & process the dataset
- Handle missing values.
- Rename columns if necessary for clarity.
- Convert data types if needed.
- Filter and/or group data for focused analysis.
- Contain code for at least 4 different types of visualizations
- Ensure each visualization is well-labeled with titles, axis labels, and legends.
- Visualizations should be clear, informative, and appropriate for the data (see guidelines below)
See the Example Graph Gallery for inspiration.
- Be thoroughly commented
- Explain the purpose of each section of the code.
- Use functions for reusable components of your code.
β’ Poster:
A scientific poster for an academic conference must effectively communicate key findings. All captions and text should be concise and relevant.
- Introduction
- Context for the dataset and why it is interesting/relevant.
- Clearly state your research questions or hypotheses.
- Methods
- Describe your process for cleaning and analyzing the data.
- Include screenshots of key Python code snippets.
- Results
- Present 2β3 visualizations from your analysis.
- Provide clear captions for each figure.
- Discussion
- Interpret your findings and discuss patterns or trends observed in the data.
- Mention any limitations of your analysis.
- Conclusions
- Summarize key takeaways.
- Suggest potential areas for future research or data exploration.
- References
- Cite the dataset and any external resources
π Conduct research on the topic to enhance your posterβs Introduction and/or Discussion sections!
- Cite the dataset and any external resources
Choosing Appropriate Visualizations
π‘ Choosing the right type of visualization is crucial for effectively communicating your findings. Below are some guidelines to help you decide. Check out this resource from UC Berkeley for additional tips.
Example Chart Selection Table
| Question Type | Recommended Chart Type |
|---|---|
| Proportions within a whole | Pie chart, Stacked bar plot, Word cloud |
| Trends over time, sequential events | Line plot, Animated plots |
| Distribution of a variable | Histogram, Box plot (comparing between groups) |
| Comparison across categories | Bar plot, Grouped bar plot (catplot) |
| Relationships between variables | Scatter plot (two variables), Heatmap (multiple variables) |
