Quantcast
Viewing all articles
Browse latest Browse all 8

Answer by eipi10 for Explain ggplot2 warning: "Removed k rows containing missing values"

The behavior you're seeing is due to how ggplot2 deals with data that are outside the axis ranges of the plot. scale_y_continuous (or, equivalently, ylim) excludes values outside the plot area when calculating statistics, summaries, or regression lines. coord_cartesian includes all values in these calculations, regardless of whether they are visible in the plot area. Here are some examples:

library(ggplot2)# Set one point to a large hp valued = mtcarsd$hp[d$hp==max(d$hp)] = 1000

All points are visible in this plot:

ggplot(d, aes(mpg, hp)) +  geom_point() +  geom_smooth(method="lm") +  labs(title="All points are visible; no warnings")#> `geom_smooth()` using formula 'y ~ x'

Image may be NSFW.
Clik here to view.

In the plot below, one point with hp = 1000 is outside the y-axis range of the plot. Because we used scale_y_continuous to set the y-axis range, this point is not included in any other statistics or summary measures calculated by ggplot, such as the linear regression line calculated by geom_smooth. ggplot also provides warnings about the excluded point.

ggplot(d, aes(mpg, hp)) +  geom_point() +  scale_y_continuous(limits=c(0,300)) +  # Change this to limits=c(0,1000) and the warning disappears  geom_smooth(method="lm") +  labs(title="scale_y_continuous: excluded point is not used for regression line")#> `geom_smooth()` using formula 'y ~ x'#> Warning: Removed 1 rows containing non-finite values (stat_smooth).#> Warning: Removed 1 rows containing missing values (geom_point).

Image may be NSFW.
Clik here to view.

In the plot below, the point with hp = 1000 is still outside the y-axis range of the plot. However, because we used coord_cartesian, this point is nevertheless included in any statistics or summary measures that ggplot calculates, such as the linear regression line.

If you compare this and the previous plot, you can see that the linear regression line in the second plot has a much steeper slope and wider confidence bands, because the point with hp=1000 is included when calculating the regression line, even though it's not visible in the plot.

ggplot(d, aes(mpg, hp)) +  geom_point() +  coord_cartesian(ylim=c(0,300)) +  geom_smooth(method="lm") +  labs(title="coord_cartesian: excluded point is still used for regression line")#> `geom_smooth()` using formula 'y ~ x'

Image may be NSFW.
Clik here to view.


Viewing all articles
Browse latest Browse all 8

Trending Articles