From the Vega Example Gallery
A histogram subdivides a numerical range into bins, and counts the number of data points with each segment. The resulting bar chart provides a discrete estimate of the probability density function.
This example demonstrates a histogram over a numerical range, with a segment to show the prevalence of null values.
Visual comparison of estimated probability distributions for a sample of numeric values. A normal (Gaussian) distribution parameterized by the mean and standard deviation, and a kernel density estimate. This example supports estimates of either probability density functions (pdf) or cumulative distribution functions (cdf), using Vega’s density transform.
A box plot summarizes a distribution of quantitative values using a set of summary statistics. Here, the boxes show the interquartile range (IQR), with the white bar indicating the median value. The thin lines (“whiskers”) currently show the extent of the minimum and maximum values; other values, such as whiskers extending 1.5 * IQR from each end of the box, are often used as well. See the violin plot example for an alternative approach.
A violin plot visualizes a distribution of quantitative values as a continuous approximation of the probability density function, computed using kernel density estimation (KDE). The densities are additionally annotated with the median value and interquartile range, shown as black lines. Violin plots can be more informative than classical box plots.
A plot of the top-k film directors by aggregate worldwide gross. Performs an aggregation of all directors, ranks them, and filters to only the top results, using the 'window' transform.
A plot of the top-k film directors, plus all other directors, by aggregate worldwide gross. Unlike the previous example, this chart includes a category of all other directors aggregated together. The visualization spec first computes aggregates for all directors and ranks them. It then copies these ranks back to the source data using a lookup transform, and determines which directors belong in the “other” category before performing a final aggregation.
A binned scatterplot is a more scalable alternative to the standard scatter plot. The data points are grouped into bins, and an aggregate statistic is used to summarize each bin. Here we use a circular area encoding to depict the count of records, visualizing the density of data points. For higher bin counts color might instead be used, though with some loss of perceptual comparison accuracy.
A contour plot depicts the density of data points using a set of discrete levels. Akin to contour lines on topographic maps, each contour boundary is an isoline of constant density. Kernel density estimation is performed to generate a continuous approximation of the sample density.
A wheat plot is an alternative to standard dot plots and histograms that incorporates aspects of both. The x-coordinate of a point is based on its exact value. The y-coordinate is determined by grouping points into histogram bins, then stacking them based on their rank order within each bin. While not scalable to large numbers of data points, wheat plots allow inspection of (and interaction with) individual points without overplotting.
A quantile dot plot represents a probability distribution by taking a uniform sample of quantile values and plotting them in a dot plot. It visualizes a representative set of possible outcomes of a random process, and provides a discrete alternative to probability density and violin plots in which finding probability intervals reduces to counting dots in the display.
This example visualizes quantiles for a log-normal distribution that models hypothetical bus arrival times (in minutes from the current time), following the example of Kay, Kola, Hullman and Munson, 2016. If we are willing to miss a bus 2 out of 20 times, given 20 quantiles we can count up 2 dots from the left to get the time we should arrive at the bus stop.
Click or drag on the chart to explore risk thresholds for arriving at the bus stop. Double-click to remove the threshold.
A quantile-quantile (or Q-Q) plot visually compares two probability distributions by plotting a set of matching quantile values for both. For example, plotting the corresponding 1st, 2nd, 3rd, etc., percentiles for each distribution. Q-Q plots are often used to plot an empirical data distribution against a theoretical distribution. If the two distributions are similar, they will lie along a line; notable deviations from a line are evidence of different distribution functions.
This example compares an empirical sample against two theoretical distributions. Change the input data source (samples from normal or uniform distributions) to observe how different samples compare with the theoretical distributions. The quantile transform produces quantile values for input data; the quantileUniform and quantileNormal expression functions produce the theoretical quantile values.
Rather than showing a continuous probability distribution, Hypothetical Outcome Plots (or HOPs) visualize a set of draws from a distribution, where each draw is shown as a new plot in either a small multiples or animated form.
This example – inspired by The New York Times – displays random draws for a simulated time-series of values (these could be sales or employment statistics). The noise signal determines the amount of random variation added to the signal. The trend signal determines the strength of a linear trend, where zero corresponds to no trend at all (a flat uniform distribution). When the noise is high enough, draws from a distribution without any underlying trend may cause us to “hallucinate” interesting variations. Viewing the different frames may help viewers get a more intuitive sense of random variation.
A two-dimensional regression analysis models one data variable as a function of another. The resulting model produces a trend line that summarizes and extrapolates observed data. This example uses parametric regression models to predict IMDB users’ film ratings based on Rotten Tomatoes critics’ ratings.
Locally-estimated regression produces a trend line by performing weighted regressions over a sliding window of points. The bandwidth parameter determines the size of the sliding window of nearest-neighbor points, expressed as a fraction of the total number of points included.