top of page
Search
  • rytrow

40-GEP for SCC: Fishing for Prognostic Answers in a Sea of Fishy Statistics

Updated: Dec 10, 2022

Gene expression profiling is a promising assay that will likely provide enhanced predictive modeling for tumors of the skin and other tissues. In the setting of melanoma, it has promise to provide improved prognostic information, potentially obviate the need for the somewhat controversial sentinel lymph node biopsy, and guide treatment.


One of the challenges with cutaneous squamous cell carcinoma (cSCC) is that current staging systems are relatively inaccurate in predicting metastasis: 30% of metastatic outcomes come from lower-risk tumors (BWH T1/T2a), and 25% of higher-risk tumors (BWH T2b/T3) have favorable outcomes (as stated in 34821148). Gene expression profiling will hopefully improve on that. The original article by Wysong et al. in JAAD (32344066) laid the foundation for this work, but not without raising a number of questions.


To understand the challenges with interpretation and adoption of the results of this study, one must first understand the background epidemiology of cSCC and its metastatic rates. The question in this context to ask is: does the study population from which a statistic was derived represent the overall population of patients with cSCC at large? Or at the very least, does it represent a population that is generalizable enough or similar enough to the majority of practice settings that it is clinically useful? Or, at the very very least, can we pinpoint, precisely, what population of patients to which the statistic best applies?


In 2012 there was a study in New Zealand, one of the largest studies published on this topic (22592943), and included nearly 9000 cSCCs among roughly 6000 patients. Follow up was mean 71 months (median 70 months) with a range of range 31–121 months and standard deviation of 25 months; important, because there is data suggesting that the majority of mets occur within 36mo. (23677079) Risk factors of note that were captured include: 8% poorly differentiated, 1%PNI, and 1% LVI. In this cohort, 43% of SCCs were on the trunk and extremities and 54% on the head and neck (3% other locations). They calculated an overall 1.9-2.6% metastatic rate. A range was provided because a large number of these metastases, 65% in fact, did not have a clearly identifiable primary lesion. Thus, the range was reflecting two scenarios, the lower of which included these 65% of metastases as mucosal in origin and thus not the result of one of the 9000 cSCCs included, and the higher classifying them all as cutaneous origin.


In 2013 a study out of Brigham and Women's Hospital looked at 1832 cSCCs in 985 patients, median follow up 50 months. (range 2-142), and found a 3.7% nodal metastasis rate, 94% of which occurred by 3 years and 100% by 4 years; 62% trunk and extremities, 29% head and neck, (23677079) likely statistically different from the New Zealand study. Compared to the New Zealand study, notable risk factors included 12.5% poorly differentiated and 4.3% PNI, of which about 1/3 was >0.1mm, quite a bit higher, and possibly the reason the reported metastatic rate was about 50% higher as well (strangely, descriptive statistics on LVI were left out, despite presenting univariate analysis that included this factor). Neither of these studies broke the gross metastatic rate down by tumor location, a helpful tool that could be used to compare metastatic rates across studies with varying proportions of tumors of the head and neck vs other locations. But, they did provide statistical analyses that suggested at least some locations on the head and neck were higher risk (along with anogenital locations), although there were discrepancies between both the New Zealand and BWH cohorts.


The metastatic rates in the New Zealand and BWH cohorts are likely statistically different at 2.6% and 3.7%, respectively, but clinically similar when you consider the former translates to a metastasis in 1 of every 38 patients and the latter 1 in every 27. But compare this to the Wysong paper in which metastatic rate was 16.2%! That's 1 in every 6 patients! Significantly higher than population estimates, as was the rate of PNI (11.2%, 2.2% >0.1mm), despite similar rates of poorly differentiated tumors (11.2%). We will return more to the implications of this later, but we know this at least will have an impact on the positive predictive value of the test, which increases as the prevalence of an outcome increases.


If the goal of the 40-GEP is to improve on prognostic value of clinical staging criteria like AJCC8 and BWH, maybe we should take a closer at look at least part of their construction to see if the cohorts used to create the clinical staging (prognostic) systems are similar to those used to created the 40-GEP.


In the original Brigham and Women's staging system paper, Jambusaria-Pahlajani et al (23325457), authors selected cSCC which had at least 1 of the following risk factors: PNI/LVI, poor differentiation, depth beyond fat, diameter of 2cm or more, ear location, or vermillion lip location. This lead to a high percentage of tumors with these risk factors: 24% PNI, 20% poorly differentiated, 18% 2cm or greater, 15% depth beyond fat. Despite selecting a cohort rich with high risk factors, rate of metastasis in BWH T1 tumors was only 0.7%, compared to 10.7% in the Wysong paper; it was 9.8% overall compared to 16.2% in the Wysong paper. Only 16% of metastases came from BWH T1/T2a tumors, and 18% of BWH T2b/T3 had favorable outcomes (no SCC-related events - unclear if this included local recurrence); this is compared to 30% and 25%, respectively, approximated by the literature review in the introduction of Ibrahim et al (34821148).


Speaking of which, said Ibrahim et al paper (34821148) attempted to integrate 40-GEP with clinical factors, similar to what recently has been done with the i31-GEP for melanoma. Metastatic rates, similar to other 40-GEP-oriented papers were high: 15% overall and 9.8% in BWH T1 tumors. Rates of PNI (12.6%) were also well above background population estimates (1%-4.2%). Of note, Mohs surgery was the definitive surgical modality in a distinctly higher proportion of tumors without metastasis (81.5%) than with (66.7%); this was also true in an almost identical fashion in the Wysong paper (the BWH paper, 23677079, claimed to be underpowered to assess this).


So how do we make sense of all these statistics? It's confusing, and without the raw data and standardization of research protocols it is difficult to do a rigorous statistical analysis. But, we can do some napkin math to at least shed some light on what is going on. For starters, compare the characteristics of the 3 cohorts in the table below:

Overall Metastatic Rate

BWH T1

Metastatic Rate

BWH T2a Metastatic Rate

Jambusaria-Pahlajani et al

(PMID: 23325457)

9.8%

0.7%

4.5%

Wysong et al

(PMID: 32344066)

16.2%

10.8%

19.4%

Ibrahim et al

(PMID: 34821148)

15.0%

9.5%

15.2%

The first thing that is evident is the overall metastatic rate for all of these papers is far higher than population estimates, which in the two papers produced in this blog was <5%. This isn't entirely surprising, because in the process of creating a prognostic system you will enrich the population with cases that have positive events (recurrence, metastasis, death, etc.) in order to have a high enough event rate to make the calculation feasible. However, subsequent validation should be done on a cohort of randomly selected patients that more closely resembles the population to which you intend to apply the prognostic model or staging system. This doesn't appear to have been done in any of these papers.


The second thing that is evident is the dramatically higher metastatic rate in low risk tumors (BWH T1/T2a). The BWH paper more closely represents true population rates in these low risk tumors, whereas the Wysong and Ibrahim papers reflect a population with a roughly 1 in 10 rate of metastasis for T1 tumors and a roughly 1 in 5 to 1 in 7 ate for T2a tumors! How this happens, I think, is self evident. In the process of creating a GEP model, you want a substantial number of tumors of all stages with metastatic or other events of significance you are attempting to predict, so cherry-pick some T1 or T2a tumors than metastasized, say, in order to enrich your cohort with metastatic events. These all may contain a similar genetic profile that is related to poorer outcomes. However, during this cherry-picking process you leave been numerous other T1 or T2a tumors with the same genetic profile that didn't metastasize.


What is the impact of this? We now turn to some statistics that are often used to assess how well a test performs: sensitivity, specificity, positive predictive value, negative predictive value, and often ignored, accuracy. Sensitivity and specificity are constants, and are derived from the prognostic model that you created. These are the measures that would be influenced by cherry-picking certain cases during modeling. They can also be manipulated by adjusting your thresholds for positivity. For example. say expression of 5 genes predicted metastasis 100% of the time. If you set this as your criteria, 5 out of 5 genes for a positive test, your specificity would be 100%. However, some tumors with 4 out of 5, 3 out of 5, 2 out of 5, 1 out of 5, and maybe even 0 of 5 genes expressed will also metastasize and so the sensitivity of the test will reflect this: you will miss all those tumors with less than 5 genes expressed that metastasized. For some tests you may sacrifice specificity at the expense of sensitivity. This is the case for disease that have highly sensitive and specific confirmatory testing that may be more expensive or carry a higher risk of complications. In this case you run the cheaper, less invasive screening test (ie a GEP test) and then confirm with a more expensive and invasive gold standard. The Cologuard GEP and traditional colonoscopy is a good example of this. However, this sacrifice in sensitivity and specificity is more crucial for cSCC. For colon cancer, colonoscopies are more costly and invasive, but still generally safe with minimal morbidity; they are also highly sensitive and specific. Follow up testing and interventions for cSCC, however, are less sensitive and specific (like sentinel lymph node biopsy, complete lymph node directions, adjuvant radiation therapy, etc.). The perfect balance between sensitivity and specificity is, therefore, much tougher to strike, and accuracy, the oft forgotten statistic is very important to evaluate.


Accuracy is, plain and simply, the number of tests that are correct, whether truly positive or truly negative. It can be derived from sensitivity and specificity using the equation: accuracy = (sensitivity) x (prevalence) + (specificity) x (1-prevalence). Wysong et al and Ibrahim et all both provide sensitivity, specificity, and prevalences so this can be derived. Compare the results in the table below:

TEST ACCURACY

40-GEP

Class 2A/B

vs

Class 1

40-GEP

Class 2B

vs

Class 2A/1

BWH

High Risk (T3/T2b)

vs

Low Risk (T2a/T1)

Wysong et al

(PMID: 32344066)

68%

85%

80%

Ibrahim et al

(PMID: 34821148)

55%

85%

81%

As you can see, there is an overall slight advantage to the 40-GEP over (or in addition to in the case of Ibrahim et al) conventional clinical prognostic staging systems when examining the highest risk, Class 2B result, to lower risk 40-GEP results. In plane terms, you would be wrong 1 out of every 6 times if you based your clinical decision making on whether the result was 2B vs 2A/1 (aka all 2B results you recommended sentinel node and all 2A/1 observation). This is compared to being wrong 1 out of every 5 times if you staged via BWH criteria, and recommended sentinel node if the result was T2b or higher and observation of T2a or lower. Remember, this works in both directions, meaning 1 in 6 times using the 40-GEP you will recommend either over-treatment or under-treatment.

The accuracy of considering any class 2 result as "high risk" and adjusting treatment accordingly is far worse, resulting in an incorrect recommendation in roughly 1 in 3 to nearly 1 in 2 cases. This improved accuracy in regarding only class 2B results as high-risk comes at cost in sensitivity, however, with a class 2B = high-risk paradigm providing 19-29% sensitivity and a class 2 (A or B) = high-risk paradigm providing 65-78% sensitivity. You could extrapolate from this that you will be wrong less frequently (1 in 6 vs 1 in 2-3), but when you are wrong you are more likely to under treat than over treat.


Interestingly, although sensitivity and specificity do not change with respect to prevalence, accuracy will, as prevalence will impact the relative number of true negatives and positives. When sensitivity and specificity are close, within 10-20% of each other, the impact is minimal. But, as they differ more widely, it has more of an impact. For tests which have a much lower sensitivity relative to specificity, a decrease in prevalence of an outcome will increase the accuracy, as the number of false negatives will decrease much faster than the increase in false positives. The reverse is true when sensitivity is much higher relative to specificity. You can play around with these two calculators to see this in action: calculator 1, calculator 2.


Accuracy is the statistic of narcissistic doctors, "how often will I be correct?" But for patients, positive and negative predictive value is the perspective that is probably more applicable. The impact of prevalence on PPV and NPV is well known, and more profound than that which is seen on accuracy. See the exhibit in the table below, which shows how the PPV of each prognostic system changes as you change prevalence from that which was seen in the population studied, to that which is more in line with overall population estimates (using rough round numbers 5% and 2.5%, for simplicity).

Positive predictive value based on study population prevalence compared to population estimates (5%, 2.5%)

40-GEP

Class 2A/B

vs

Class 1

40-GEP

Class 2B

vs

Class 2A/1

BWH

High Risk (T3/T2b)

vs

Low Risk (T2a/T1)

Wysong et al

(PMID: 32344066)

29% (10%, 5%)

60% (29%, 17%)

35% (13%, 7%)

Ibrahim et al

(PMID: 34821148)

24% (8%, 4%)

52% (24%, 14%)

34% (13%, 7%)

Unsurprisingly, the PPV drops dramatically when the prevalence of metastasis drops. It's worth pointing out at this point that the original BWH staging study had a comparative 84% sensitivity and 85.3% sensitivity for the cohort in that study when comparing high risk (T3/T2b) and low risk (T2a/T1). This translates to a 38% PPV considering the prevalence rate of 9.8% in that study, decreasing to a 23% if applied to a population with 5% prevalence of metastasis and 13% to a cohort with 2.5% prevalence of metastasis. This is somewhat more robust than the 40-GEP test with respect to the ability to maintain predictive value across populations with different rates of metastasis. Of note, and I don't know why this discrepancy exists, but if you recalculate the PPV using the original numbers from the BWH staging paper assuming a 16.2% prevalence of metastasis as seen in the Wysong paper, the PPV is 52%, not 35%. Sensitivity and specificity were 84% and 85%, respectively, compared to 25% and 91% for the BWH staging system reported in the Wysong paper. Similar discrepancies exist in the Ibrahim paper.


What is evident is that all of these staging systems have limited PPV, especially when applied to populations that have lower rates of metastasis. It bears mentioning that the reverse is true for negative predictive value, as seen below, especially when the prevalence rate is sufficiently low, Although BWH staging alone is sufficient with low prevalences of metastasis.

Negative predictive value based on study population metastasis rate compared to population estimate (5%)

40-GEP

Class 2A/B

vs

Class 1

40-GEP

Class 2B

vs

Class 2A/1

BWH

High Risk (T3/T2b)

vs

Low Risk (T2a/T1)

Wysong et al

(PMID: 32344066)

91% (97%)

87% (96%)

​86% (96%)

Ibrahim et al

(PMID: 34821148)

93% (98%)

87% (96%)

88% (96%)

There is an interesting figure in the supplement of Ibrahim et al, supplemental Figure 2. It is interesting in two ways: one, I think visualizing the impact of 40-GEP on prognosis is most helpful when based off of current clinical staging criteria. In other words, you stage the patient clinically and they layer on top an additional data point of the 40-GEP to modify that risk. It doesn't provide adequate monotonicity (increasing stages correlate with worse outcomes) nor homogeneity (outcomes were similar within each staging category) to be used as a staging modality, however.


A large part of the issue is that lack of standardization amongst studies and thus the limited generalizability of the conclusions and/or data from one cohort to other cohorts, most importantly, the unique populations of patients being managed by any given dermatologists. These factors are myriad and include: how patient data is collected, is tissue from biopsies, excisions, Mohs debulks, and is it reviewed by a pathologist for specific factors like PNI or LVI, as these can be missed on small biopsy samples without considerable depth; tumor location, which varies from study to study considerably, and may impact a variety of factors from delays in diagnosis, the type of treatment employed, to the intrinsic biology of the tumor; treatment modalities including whether peripheral and deep en face margin assessment (PDEMA) was used or radiation or other adjuvant therapies; and, the event rate present in any given cohort, which will significantly impact the positive and negative predictive value of a test if utilized in a cohort with a dissimilar event rate.


Accuracy is also an important measure to consider with test that has implications for both over and under diagnosis. In other words, although sensitivity - what percentage of bad outcomes are caught by the test's net - is often heralded, specificity is also incredibly important when invasive tests and potentially morbid treatments may be rendered as the result of a positive test, and, when event rates are low relative to non-events. This is reflected in the significantly improved accuracy of the test when a 2B result is used as a "actionable positive event" compared to 2A or 2B - the specificity of using 2B criterion is much higher.


So where does this leave us? With a lot of work to do. It bears emphasizing now that we have been through all of this that this is largely food for though. This isn't a rigorous statistical vetting, but some questions I have raised and have yet to have answered sufficiently; and, there may be sufficient answers. Perhaps my analysis is just flat out wrong. However, I believe as an MD with two Masters degrees in clinical research I am likely more capable than the average physician of interpreting the literature. So, if I can't do it, thousands else likely can't either, and we must seek out those who can, and who can explain it in a digestible way for the rest of us.


In an ideal scenario, we would validate these tests in large, randomly selected cohorts, in various geographic regions and practice settings. A larger sample size would also allow more subgroup analysis at the point of each BWH (or AJCC8) T stage, so that it is clear what statistical value the test has in that population. It is likely the test will be used differently based on different pre-test T stages: a likelihood ratio for each test result with pre and post test probabilities could be useful. We could also benefit from some explanation of the discrepancies in some of the metastatic rates in these published studies, and if they were randomly selected and reflect the population for which the test is intended. This is especially true for conventionally "low-risk" tumors like BWH T1 and T2a.


15 views0 comments
bottom of page