lighthouse/docs/v8-perf-faq.md

13 KiB
Исходник Постоянная ссылка Ответственный История

v8.0 Performance FAQ

Give me a summary of the perf score changes in v8.0. What's new/different?

First, it may be useful to refresh on the math behind Lighthouse's metric scores and performance score.

In Lighthouse v8.0, we updated the score curves for FCP and TBT measurements, making both a bit more strict. CLS has been updated to its new, windowed definition. Additionally, the Performance Score's weighted average was rebalanced, giving more weight to CLS and TBT than before, and slightly decreasing the weights of FCP, SI, and TTI.

From an analysis of HTTP Archive's latest crawl of the web, we project that the performance score for the majority of sites will stay the same or improve in Lighthouse 8.0.

  • ~20% of sites may see a drop of up to 5 points, though likely less
  • ~20% of sites will see little detectable change
  • ~30% of sites should see a moderate improvement of a few points
  • ~30% could see a significant improvement of 5 points or more

The biggest drops in scores are due to TBT scoring becoming stricter and the metric's slightly higher weight. The biggest improvements in scores are also due to TBT changes in the long tail and the windowing of CLS, and both metrics' higher weights.

What are the exact score weighting changes?

Changes by metric

metric v6 weight v8 weight Δ
First Contentful Paint (FCP) 15 10 -5
Speed Index (SI) 15 10 -5
Largest Contentful Paint (LCP) 25 25 0
Time To Interactive (TTI) 15 10 -5
Total Blocking Time (TBT) 25 30 5
Cumulative Layout Shift (CLS) 5 15 10

Changes by phase

phase metric v6 phase weight v8 phase weight Δ
early First Contentful Paint (FCP) 15 10 -5
mid Speed Index (SI) 40 35 -5
Largest Contentful Paint (LCP)
interactivity Time To Interactive (TTI) 40 40 0
Total Blocking Time (TBT)
predictability Cumulative Layout Shift (CLS) 5 15 10

Why did the weight of CLS go up?

When introduced in Lighthouse v6, it was still early days for the metric. There've been many improvements and bugfixes to the CLS metric since then. Now, given its maturity and established placement in Core Web Vitals, the weight increases from 5% to 15%.

Why are the Core Web Vitals metrics weighted differently in the performance score?

The Core Web Vitals metrics are independent signals in the Page Experience ranking update. Lighthouse weighs each lab equivalent metric based on what we believe creates the best incentives to improve overall page experience for users.

LCP, CLS, and TBT are very good metrics and that's why they are the three highest-weighted metrics in the performance score.

How should I think about the Lighthouse performance score in relation to Core Web Vitals?

Core Web Vitals refer to a specific set of key user experience metrics, their passing thresholds, and percentile at which they're measured. In general, CWV's primary focus is field data.

The Lighthouse score is a means to understand the degree of opportunity available to improve critical elements of user experience. The lower the score, the more likely the user will struggle with load performance, responsiveness, or content stability.

Lighthouse's lab-based data overlaps with Core Web Vitals in a few key ways. Lighthouse features two of the three core vitals (LCP and CLS) with the exact same passing thresholds. There's no user input in a Lighthouse run, so it cannot compute FID. Instead, we have TBT, which you can consider a proxy metric for FID, and though they measure two different things they are both signals about a page's interactivity.

So CWV and Lighthouse have commonalities, but are different. How can you rationalize paying attention to both?

Ultimately, a combination approach is most effective. Use field data for the long-term overview of your user's experience, and use lab data to iterate your way to the best experience possible for your users. CrUX data summarizes the most recent 28 days, so it'll take some time to confidently determine that any change has definite impact.

Lighthouse's analysis allows you to debug and optimize in an environment that is repeatable with an immediate feedback loop. In addition, lab-based tooling can provide significantly more detail than field instrumentation, as it's not limited to web-exposed APIs and cross-origin restrictions.

The exact numbers of your lab and field metrics aren't expected to match, but any substantial improvement to your lab metrics should be observable in the field once it's been deployed. The higher the Lighthouse score, the less you're leaving up to chance in the field.

What blindspots from the field can lab tooling illuminate?

Field data analyzes all successful page loads. But if failed and aborted loads are excluded, or reporting is blocked by extensions, the collected field data can suffer from survivorship bias. Users who have better experiences use your site more; that's why we care about performance in the first place! Lab tooling shows you the quality of the experience for these users that field data might be missing entirely.

Lighthouse mobile reports emulate a slow 4G connection on a mid-tier Android device. While field data might not indicate these conditions are especially common for your site, analyzing how your site performs in these tougher conditions helps expand your site's audience. Lighthouse identifies the worst experiences, experiences you can't see in the field because they were so bad the user never came back (or waited around in the first place).

As always, using both lab and field data to understand and optimize your user experience is best practice. Read more about field & lab.

How should I work to optimize CLS differently given that it has been updated?

The windowing adjustment will likely not have much effect for the lab measurement, but instead will have a large effect on the field CLS for long-lived pages.

Lighthouse 8 introduces another adjustment to our CLS definition: including layout shift contributions from subframes. This brings our implementation in line with how CrUX computes field CLS. This comes with the implication that iframes (including ones you may not control) may be adding layout shifts which ultimately affect your CLS score. Keep in mind that the subframe contributions are weighted by the in-viewport portion of the iframe.

Why don't the numbers for TBT and FID match, if TBT is a proxy metric for FID?

The commonality between TBT (collected in a lab environment) and FID (collected in a field context) is that they measure the impact on input responsiveness from long tasks on the main thread. Beyond that, they're quite different. FID captures the delay in handling the first input event of the page, whenever that input happened. TBT roughly captures how dangerous the length of all the main thread's tasks are.

It's very possible to have a page that does well on FID, but poorly on TBT. And it's slightly harder, but possible, to do well on TBT but poorly on FID*. So, you shouldn't expect your TBT and FID measurements to correlate strongly. A large-scale analysis found their Spearman's ρ at about 0.40, which indicates a connection, but not one as strong as many would prefer.

From the Lighthouse project's perspective, the current passing threshold for FID is quite lenient but more importantly, the percentile-of-record for FID (75th percentile) is not sufficient for detecting issues. The 95th percentile is a much stronger indicator of problematic interactions for this metric. We encourage user-centric teams to focus on the 95th percentile of all input delays (not just the first) in their field data in order to identify and address problems that surface just 5% of the time.

*Aside: the Chrome 91 FID change for double-tap-to-zoom fixes a lot of high FID / low TBT cases and may be observable in your field metrics, with higher percentiles improving slightly. Most remaining high FID / low TBT cases are likely due to incorrect meta viewport tags, which Lighthouse will flag. Delivering a mobile-friendly viewport, reducing main-thread blocking JS, and keeping your TBT low is the best defense against bad FID in the field.

Overall, what motivated the changes to the performance score?

As with all Lighthouse score updates, changes are made to reflect the latest in how to measure user-experience quality holistically and accurately, and to focus attention on key priorities.

Heavy JS and long tasks are a problem for the web that's worsening. Field FID is currently too lenient and not sufficiently incentivizing action to address the problem. Lighthouse has historically weighed its interactivity metrics at 40-55% of the performance score and—as interactivity is key to user experience—we maintain a 40% weighting (TBT and TTI together) in Lighthouse 8.0.

FCP's score curve was adjusted to align with the current de facto "good" threshold, and as a result will score a bit more strictly.

The curve for TBT was made stricter to more closely approach the ideal score curve. TBT has had (and still has) a more lenient curve than our methodology dictates, but the new curve is more linear which means there's a larger range where improvements in the metric are rewarded with improvements in the score. If your page currently scores poorly with TBT, the new curve will be more responsive to changes as page performance incrementally improves.

FCP's weight drops slightly from 15% to 10% because it's fairly gameable and is also partly captured by Speed Index.

What's the story with TTI?

TTI serves a useful role as it's the largest metric value reported (often >10 seconds) and helps anchor perceptions.

We see TBT as a stronger metric for evaluating the health of your main thread and its impact on interactivity, plus it has lower variability. TTI serves as a nice complement that captures the cost of long tasks, often from heavy JavaScript. That said, we expect to continue to reduce the weight of TTI and will likely remove it in a future major Lighthouse release.

How does the Lighthouse Perf score get calculated? What is it based on?

The Lighthouse perf score is calculated from a weighted, blended set of performance metrics. You can see the current and previous Lighthouse score compositions (which metrics we are blending together, and at what weights) in the score calculator, and learn more about the calculation specifics here.

What is the most exciting update in LH v8?

We're really excited about the interactive treemap, filtering audits by metric, and the new Content Security Policy audit, which was a collaboration with the Google Web Security team.