[파이썬] ggplot 상관 계수 및 회귀선 추가

In data analysis and visualization, it is often useful to examine the relationship between two variables. One common approach is to compute the correlation coefficient to measure the strength and direction of the relationship. Additionally, it can be helpful to visualize this relationship using a regression line.

In this blog post, we will explore how to add correlation coefficients and regression lines to a ggplot in Python using the ggplot library. We will walk through an example and provide sample code to help you get started.

Prerequisites

Before we start, make sure you have the following components installed and set up:

Example Dataset

Let’s consider a simple example where we have a dataset containing the heights and weights of individuals. We want to analyze the relationship between these two variables and visualize it using a scatter plot with a regression line.

import pandas as pd

data = pd.DataFrame({
    'Height': [160, 165, 170, 168, 172, 178, 175, 180, 185, 190],
    'Weight': [60, 65, 68, 70, 75, 80, 82, 85, 90, 95]
})

Computing Correlation Coefficients

To compute the correlation coefficient between two variables, you can use the corr() function in pandas:

correlation_coeff = data['Height'].corr(data['Weight'])
print(f"Correlation Coefficient: {correlation_coeff}")

In this example, correlation_coeff will contain the computed correlation coefficient.

Adding a Regression Line to a ggplot

To add a regression line to a ggplot, we need to use the geom_smooth() function. This function fits a regression line to the data and visualizes it in the plot.

Let’s create the scatter plot with the regression line:

from ggplot import *

p = ggplot(data, aes(x='Height', y='Weight')) + \
    geom_point() + \
    geom_smooth(method='lm')
  
print(p)

In this code, geom_point() is used to create the scatter plot, while geom_smooth(method='lm') adds the regression line using the method ‘lm’ (linear regression).

Adding Correlation Coefficient to the ggplot

To add the correlation coefficient to the ggplot, we can use the annotate() function. This function is used to add text or labels to the plot.

p = p + \
    annotate("text", x=165, y=80, label=f"Corr: {correlation_coeff:.2f}")
  
print(p)

In this example, we use annotate() to add the correlation coefficient to the plot at the specified position (165, 80), using the format Corr: {correlation_coeff:.2f}.

Conclusion

In this blog post, we explored how to add correlation coefficients and regression lines to ggplot in Python. We covered how to compute the correlation coefficient using pandas, and how to create a scatter plot with a regression line using ggplot. We also discussed how to add the correlation coefficient to the plot using the annotate() function.

Being able to visualize the relationship between variables and estimate the strength of their association is crucial in data analysis. By adding correlation coefficients and regression lines to your ggplot, you can gain deeper insights into your data.

I hope this article was helpful in understanding how to incorporate correlation coefficients and regression lines in ggplot. Happy plotting!