How to compare clinical trial data to a natural history control?

Today I have a hard statistics question I’m posting here in the hopes that someone can help me find an answer.

Context: I am interested in understanding how to analyze data from clinical trials with a natural history control, meaning, a clinical trial where a group of people (say, those with genetic risk for a disease) are recruited, all of them are treated with a drug, and their survival (say, age at disease onset, or death due to this disease) is compared to the survival of untreated historical controls with the same genetic risk.

Before I start, I just want to acknowledge that I’ve read FDA Guidance documents (e.g. Rare Diseases: Common Issues in Drug Development) touching on this topic and I’m well aware that there are many types of bias that can creep into natural history comparisons and that, for this reason, FDA only rarely accepts such trials as evidence of a drug’s efficacy, instead usually considering randomized trials as far stronger evidence. The question of how to determine whether a natural history cohort can fairly be compared to a prospective cohort treated with a drug is an important and complex question. But that is not the topic of my question today. Instead, I am finding that I am stumped on a yet much simpler question, the question of if a natural history cohort is considered comparable and unbiased and so on, then how, statistically speaking, would one do the comparison?

So here’s a scenario.

For simplicity, say my disease has a single, known, genetic cause, and people with this genotype are perfectly healthy until some age, and then they suddenly become very sick. The age of disease onset is highly variable, and hazard varies as a function of age. Everyone in every dataset described below has this disease-causing genotype.
In dataset A, I have data on people in a hypothetical clinical trial. They enrolled at one age (starting_age), were treated with a drug, and were followed a variable amount of time until a second age (last_age) at which they either became sick (event == 1) or withdrew from the trial (event == 0).
In an “ideal ideal” scenario, perhaps the natural history cohort would be a large number of people with this genotype, followed from birth until they either died of this disease or of an unrelated cause. That sort of dataset doesn’t exist. Instead, consider two possible options for what the natural history cohort could be (B and C)…
In dataset B, I have data on people who enrolled in a prospective study at some random age (starting_age), were not treated with any drug, and were simply followed for some variable amount of time until a second age (last_age) at which they either became sick (event == 1) or withdrew from the study (event == 0). The distribution of starting ages and the number of years of followup cannot be assumed to follow the same distributions as in dataset A.
In dataset C, I have purely retrospective data; here there was no prospective followup. We simply observed each person once. Some people were observed to become sick at some particular age (age, and event == 1) while others were seen to still be healthy at some particular age (age, and event == 0).

If I understand the terminology correctly, the data in A and B would be considered left-truncated and right-censored; the data in C would be considered right-censored only. According to terms used in Cain 2011, I believe that dataset B is a “prevalent cohort” while dataset C is an “incident cohort”.

My questions are:

How (if at all) can one construct a survival function (or a hazard function, etc) for dataset A or B, given that the data are left-truncated at different ages for each individual?
What statistical test (or bootstrapping method) would be used to test whether people in dataset A survive significantly longer or have significantly lower hazard than those in B or C?
If all you had was dataset B or C, and you wanted to estimate the statistical power of a given clinical trial with N individuals for a given hazard ratio, how would you go about it?
(Bonus question) if people also had, say, different genetic mutations associated with different hazards as a function of age, how would you incorporate that variable into the answers above?

Some things I’ve tried so far:

I considered Flora’s Z statistic [Flora 1978] which some have applied in similar situations, but I am concerned that it doesn’t account for sampling variance in the natural history cohort being used as a reference.
I Googled a few different combinations of terms such as survival analysis with different left truncation times and found some pages that discuss similar problems, such as Survival Analysis: Left-Truncated Data, Surviving Left Truncation using PROC PHREG, but these did not explain the underlying math of their solution.
I finally came upon Cain 2011 which discusses the issue in detail and, helpfully, has R code for handling left-truncation in the supplement. They implemented their own function for an MLE incorporating left-truncation, but the claim is that left-truncation can already be handled in a Cox proportional hazards model using functionality built into the R survival library, for instance: coxph(Surv(time=agestart,time2=x,event=cx,type='counting')~bmi). Here a Surv object is created using time as the starting age, and time2 as the age where event either did or didn’t happen. That sounds sensible, though the help file for Surv doesn’t explain what it is actually doing here — for example it does not explain what the counting model entails. I turned to the survival manual and, as a reference for counting, got pointed to Andersen & Gill 1982 which in turn was a bit over my head and does not seem to discuss left truncation, or at least not by that name. One peculiarity is the “counting” model seems built to fit cases where an individual can have multiple events in their lifetime, which is not the case in my example, but perhaps that doesn’t matter.
Based on the above, it seemed like a Cox “counting” model might be the right way to compare dataset A to B, and I was able to do such a comparison in R (see code below) though I am still struggling to understand whether I am doing the right thing here, and I am not sure how (if at all) such an approach could be applied to compare dataset A to C.
Finally, I searched Cross Validated for survival left truncation and found a large number of instances where people had asked similar questions as I’m asking (1, 2, 3, 4, 4, 5, 6, 7) but most were unanswered; one pointed to the Cain reference above, and another pointed to Klein & Moeschberger 2003 which was helpful (see e.g. p. 123 and p. 312) and seems to support the notion that A and B can be compared using a fairly simple Cox approach, though it doesn’t address the comparability of A to C, nor the power calculation question (though, if the statistical test question is answered, I could presumably get to the power with some bootstrapping).

Below are some hypothetical data in R illustrating this scenario. I’ve included 20 rows for each dataset, though in case it matters, in the real life scenario I am imagining, datasets A and B could be perhaps on the order of 50 or 100 patients, and dataset C might be on the order of 500 or 1,000 patients.

# A) hypothetical data from clinical trial
data_a = read.table(sep='|',header=T,textConnection("
indiv_id|starting_age|last_age|event
33|42|0
45|49|0
47|52|1
30|34|0
37|44|0
34|37|0
29|34|0
58|66|0
58|60|0
66|75|0
37|41|0
37|46|0
58|62|0
44|48|1
45|50|0
56|65|0
54|63|0
36|41|0
47|55|1
45|55|0
"))

# B) hypothetical data from a prospective natural history study
data_b = read.table(sep='|',header=T,textConnection("
indiv_id|starting_age|last_age|event
19|28|0
39|52|0
38|41|1
18|27|0
20|24|0
16|20|0
39|41|0
48|50|0
40|50|0
38|41|1
40|43|1
26|29|0
37|39|0
21|30|0
36|41|1
46|48|0
27|32|0
26|29|0
29|38|0
47|58|0
"))


# C) hypothetical data from a retrospective natural history study
data_c = read.table(sep='|',header=T,textConnection("
indiv_id|age|event
43|1
53|1
64|1
45|1
88|1
70|1
66|1
55|1
51|1
48|1
63|1
36|0
61|0
63|1
63|1
57|1
74|0
63|1
59|1
57|1
"))


# one possible approach to compare A and B using Cox counting model
data_a$drug = TRUE
data_b$drug = FALSE
nh_compare = rbind(data_a, data_b)
m = survfit(Surv(time=starting_age,time2=last_age,event=event,type='counting')~drug, data=nh_compare)
summary(m)
coxph(Surv(time=starting_age,time2=last_age,event=event,type='counting')~drug, data=nh_compare)

I have posted this question to Cross Validated as well but was not allowed to include more than 2 links, hence the need for this post here. If you have an answer or suggestion, please leave a comment below and/or on the Cross Validated post. Thanks!