StartOut Index Methodology

Methodology for the StartOut Index

The StartOut Index focuses on the impact of high-growth entrepreneurs from historically underrepresented populations. Research has shown that female, Black, and LGBTQ+ founders, among many others, experience systematic barriers to founding new companies. The Index measures both their existing contributions to their home cities as well as the achievement gaps caused by those barriers.

The initial launch in 2020 focused on LGBTQ+ and women entrepreneurs and offered an analysis of interregional inequality.

For the second iteration of the Index, we released our most updated data on LGBTQ+ entrepreneurs along with women entrepreneurs, as well as initiating monthly updates. 

For this third iteration of the Index, we are releasing a feature to inform policymakers or anyone looking to discover what policy change/s could be beneficial or detrimental for your particular metropolitan area and state. We used a difference-in-difference model to show causation between a particular policy, law, ordinance, or regulation and some increase or decrease in funding, jobs, patents, and exits generated by high-growth entrepreneurs.

For a future release, we plan on incorporating more policy-related data, especially economic data, as well as developing our algorithms to make a more robust and complete dataset on the various ethnicity groups, such as South Asian, East Asian, and Hispanic/Latinx.

Threshold for High-Growth Company

For the purposes of this Index, we are interested solely in the founders of high-growth companies. As a matter of practical expediency, we laid out a set of minimum criteria to identify high-growth companies in terms of funding or economic impact. To formalize our criteria we began with two company-level data sources focused on the entrepreneurial economy: Crunchbase and Pitchbook.

These sources are the core dataset of the Index. They contain information about companies’ founders, industry, location, funding history, employment and job creation, and exits (e.g. IPOs and acquisitions) and are updated on a regular basis. The Index further augments its dataset with additional information from Wikipedia, the US Patent and Trademark Office, and the US Census Bureau. To credit patent creation to individual founders, the Index traces patent assignment through inventors to employers (companies in our dataset) and finally onto the founders of those companies.

The Index builds this set of high-growth companies and founders based on meeting one of the following criteria:

  1. received any amount of Venture Capital funding, or 
  2. received at least $250K in Angel funding, or
  3. generated at least one patent and has created jobs beyond the founding team, or
  4. had an IPO or been acquired by another company.

These criteria allow us to be flexible in our definition of a high-growth company, allowing for organizations that have had an impact even in the absence of institutional funding. 

At the time of the initial launch, these criteria identified 56,623 high-growth companies in the US. 

As of the second launch, we identified 90,946 high-growth companies in the US.

Entrepreneur Identification and Attribution

variable term for metro local LGBTQ population

The purpose of the Index is to measure how individual high-growth entrepreneurs affect metro economies. It constructs this set of individuals using the dataset of high-growth companies described above. An entrepreneur or founder is any member of the founding team, i.e. anyone present at the company before its first round of funding. If a funding date is not available, founders are identified solely from the Crunchbase and Pitchbook labels.

We chose to include people present at a company before first funding, and not just founders, because we believe anyone that joined the company without a promise of funding (or salary) made a formative, high-risk contribution to the company and its impact.

At the time of launch, these criteria identified 84,516 high-growth entrepreneurs. 

At the time of the second launch, we identified 137,661 high-growth entrepreneurs in the US.

Given a founding team for each company, the Index credits the company‚Äôs impact evenly among each of the founders. In other words, for a founding team of 5 individuals, each founder is given credit for ‚Öē of the company‚Äôs impact (e.g. ‚Öē of funding, ‚Öē of jobs, ‚Öē of patents, and ‚Öē of exit value).

Having identified all of the high-growth entrepreneurs in its dataset, the Index applies demographic labels to each: race, gender, and sexuality.  To identify LGBTQ+ entrepreneurs, we have integrated three specialized datasets. The first is the membership records of StartOut, the largest LGBTQ+ entrepreneurship group in the world, and Socos’ collaborators in this initiative. The second set of records comes from a data aggregator that provides broad gender and race information to potential employers from 37 publicly available sites. StartOut engaged the aggregator to create a specialized algorithm for identifying LGBTQ individuals in its data. Finally, we have identified a number of public repositories identifying openly LGBTQ+ business leaders, such as Wikipedia.

At the time of launch, these criteria identified 365 LGBTQ+ and 13,218 women high-growth entrepreneurs. 

As of the second launch, we identified 774 LGBTQ+ and 17,791 women high-growth entrepreneurs in the United States.

In preliminary work, we found that certain demographic labels such as gender were fairly easily derived from our existing sources. However, as previous research has shown, many LGBTQ+ entrepreneurs are either closeted or at least not public about their identity. This is easy to understand, given that the very same research has revealed the many barriers historically marginalized entrepreneurs have experienced. In addition to the hidden status of many LGBTQ+ entrepreneurs, StartOut’s data overrepresent the cities in which it maintains active chapters: San Francisco, New York, Los Angeles, Boston, Chicago, and Austin. In order to correct for this overrepresentation and known undercounting of LGBTQ entrepreneurs, we have also developed a statistical model to estimate counts over our outcome variables of interest.

The StartOut Index employs a non-parametric statistical estimate of the undercount of founders. It estimates a distribution over the likely number of hidden LGBTQ+ entrepreneurs in each Metro. This distribution is derived from three variables:

Metro’s local LGBTQ+ population (from UCLA’s Williams Institute)

Variable for a particular metro's entrepreneurial rate

Metro’s entrepreneurship rate (from the Index’s aggregated data), and

Variable for correction factor for LGBTQ entrepreneurial rates

correction factor for LGBTQ+ entrepreneurship rates (from previous StartOut research), where i is an index over individual Metros.

The first two variables,

first two variables (pop_lgbtq and e_rate)

give an estimated population of LGBTQ+ entrepreneurs that assumes that these two variables are independent of one another. However, we know that for most populations of interest entrepreneurship rates are meaningfully different than the general population.

So the final variable,

Correction factor for LGBTQ+ entrepreneurship rates (from previous StartOut research)

corrects for this dependence. The StartOut Index estimates this modifier from the joint rate of LGBTQ+ entrepreneurship (from the Index‚Äôs internal data) modified by state-level estimates of LGBTQ+ entrepreneurship rates reported in ‚ÄúThe State of LGBT Entrepreneurship‚ÄĚ whitepaper from StartOut. From this estimate we find that the rate of LGBTQ+ entrepreneurship is roughly half that of straight populations.

That gives us the Metros population of LGBTQ+ entrepreneurs

Equation for a Metro's Population of LGBTQ Entrepreneurs

All identities of entrepreneurs in the system, LGBTQ+ or otherwise, are anonymized for publication and used for no other purpose than producing the Index. 

At the time of launch, these criteria identified 84,516 high-growth entrepreneurs.

At the time of the second launch, we have identified 124,756 high-growth entrepreneurs in the US.

Entrepreneurship Equity Score

The Entrepreneurship Equity Score is a score for each metro, ranging from 0-100. By combining independent measures of innovation, job creation, and economic activity, this score reveals the impact of high-growth entrepreneurs on the Metro over a fixed period of time. The EES represents the four sub-factors computed independently, combining explicit count and statistical models.


The core entity of the Index is the Metro, derived from the census bureau‚Äôs definition of metropolitan statistical area. Metros represent large integrated economic regions. For example, the entirety of the San Francisco Bay Area‚ÄďOakland, San Jose, Marin County, and more‚Äďare all assigned to San Francisco Metro. The impact of individual entrepreneurs and founders on a Metro is measured in terms of the location of the headquarters of companies they have founded. This means that some prolific founders have had impacts across multiple Metros. This also means that impact measures such as jobs and patents are treated more as indices than explicit economic activity as they are credited back to the founding metro region even if the jobs actually exist in a separate city.¬†Lastly, to be considered as qualified, a metro area has met the minimum threshold of 50 qualified entrepreneurs.


At the time of launch, these criteria identified 77 Metros in the United States. 

At the time of the second launch, 118 Metros in the United States have been identified as qualified.


As an initial measure of direct economic impact, the Index tracks venture funding and angel investments, as well as other forms of risk capital. The timing, amount, and nature of these investments are derived from the Pitchbook and Crunchbase sources.

The Index applies the same nonparametric estimation algorithm and Metro-based normalization that is used for jobs. This results in the following sub-component2

Funding Score Normalization Formula


‚ÄúJobs‚ÄĚ is a measure of the number of jobs created by entrepreneurs of our target population in the Metro region within the specified time period. To arrive at this number the StartOut Index records the total number of jobs created by each company located within the Metro. Then, as described above, it credits those jobs to the individual founders of the company. For each founder that is a member of the current target population, the Index adds their share of job creation to its aggregate jobs measure. This provides a total count of jobs created by the target entrepreneurial population.

Because of the issue of undercounting described above, our count of jobs is extended by a non-parametric estimate of the distribution of likely job creation by those undercounted entrepreneurs. To compute this distribution, we sample from the likely number of undercounted entrepreneurs from our target populations. For example, we might have estimated that there’s a 40% probability that 5 of a given Metro’s 200 funded entrepreneurs are likely LGBTQ+ and a 60% probability that 4 of them are LGBTQ+. A distribution over likely job creation is computed as follows:

  1. Randomly select 5 individuals out of the 200.
  2. Count the number of jobs that those 5 individuals created.
  3. Add that number to the set of possible jobs created.
  4. Randomly select 5 new individuals (with replacement) out of the 200.
  5. Count the number of jobs they created.
  6. Add that number to the set of possible jobs created.
  7. Repeat this process until a stable distribution over possible jobs emerges.
  8. Then randomly select 4 individuals out of the 200.
  9. Again, repeat this process using 4 individuals until a new distribution emerges.
  10. Produce a final distribution of likely job creation by weighting the original two distributions by their initial probabilities.

This gives the Index its non-parametric estimated distribution of likely job creation by LGBTQ+ entrepreneurs. If few entrepreneurs in a local population produce jobs then the bulk of this distribution will be 0 additional jobs created. If many entrepreneurs were highly productive, then the distribution accordingly is more likely to include larger job creation values. In either case, it allows us to compute a 99% confidence interval over the likely number of jobs our undercounted population generated. For the sake of Index visualization, this distribution is simplified to its mean, a single value that is added to the job creation sub-factor.

And so the final jobs sub-factor

Jobs Subfactor Variable Term

represents a combination of directly counted job creation and statistical estimate.

The job sub-factor is then normalized to provide a final jobs score

Normalization of Subfactor Formula

From the aggregate variable

Aggregate Variable Term

we also compute a mean job score,

Mean Jobs Score Variable
Aggregate Jobs Score Formula

The aggregate job score gives the total impact of LGBTQ+ entrepreneurs on job creation within a given Metro. The mean, by comparison, gives us an idea of the individual contributions and challenges of LGBTQ+ entrepreneurs on a one-to-one basis with their straight peers.


To measure innovation, the Index relies on counts of patents created by each company. (We plan to expand to include research publications and data on media creation in the future.)

Again, the Index applies the nonparametric estimation and Metro-based normalization. This results in the sub-component

Normalization of Patents Score Formula


This score measures the total value of all acquisitions and IPOs of companies founded by target population entrepreneurs in a given Metro.

The nonparametric estimation and Metro-based normalization are applied, resulting in the final sub-component

Normalization of Exit Score Formula

Impact Size

The Index produces two main components: the Index Score and the Impact Size. The Score is a measure of the achievement gap for the given Metro region and will be discussed further below. The Impact Size is an absolute measure of the economic impact of our population of interest, i.e. LGBTQ+ entrepreneurs. 

The Impact Size is computed by adding the four normalized sub-factors together to give a measure of the ‚Äúeconomy‚ÄĚ for LGBTQ+ entrepreneurs

Impact Size Formula

Similarly we compute a mean impact size to reflect the average impact of individual LGBTQ+ entrepreneurs in a given metro

Mean Impact Size Formula

Achievement Gap

There is substantial existing research literature on wage gap, funding gap, and other barriers for underrepresented entrepreneurs. Here we seek to understand the scale of what individual Metros could achieve if those barriers were reduced. In the StartOut Index, this¬†achievable¬†prediction is represented by the purple bar. Rather than just imagine that LGBTQ+ or female entrepreneurs behave identically to their straight and male peers, or that cultural barriers can be immediately removed, the Index¬†instead uses a model of ‚Äúbest-in-class‚ÄĚ performance for each individual measure. The¬†achievable¬†represents all of the jobs or patents, e.g., that could have been.¬†The¬†achievement gap¬†is the difference between the Index‚Äôs measurements and its predictions.

For each mean sub-factor, such as 

Mean Jobs Score Variable Term

the Index computes a best-in-class performance. The Index considers each Metro and removes any Metro with three or fewer members of the population of interest. Of the remaining Metros, it then removes all entrepreneurs whose performance is 2.5 standard deviations above or below that Metro’s mean for that sub-factor. The Index recomputes the mean of the sub-factor after removing those Metros and entrepreneurial outliers and also computes an outlier-corrected sub-factor mean for non-LGBTQ+ entrepreneurs.

The three best performing Metros in terms of these sub-factor means are identified. For each Metro, a ratio of LGBTQ+ to non-LGBTQ+ sub-factor means is computed and averaged

Relative Productivity of LGBTQ+ to non-LGBTQ+ Entrepreneurs Formula

where j indexes the top 3 cities.

This ratio,

Relative Productivity of LGBTQ to non-LGBTQ Entrepreneurs Variable Term

is the relative productivity of LGBTQ+ to non-LGBTQ+ entrepreneurs for each subfactor. It equals 1 if LGBTQ+ founded companies are as productive as non-LGBTQ+ companies for that specific factor. If it is 0.5, they perform half as well. Our model of achievement gap assumes that the local LGBTQ+ populations can meet this same best-in-class productivity ratios in their city.

That¬†BiC ratio¬†is important because we can‚Äôt just take the average job creation rate from the top performing city and call it a day. Different cities are different, from industries to workforce profiles to access to funding. Instead, the Index applies the BiC jobs ratio to the average job creation rate by traditional entrepreneurs in a given Metro, implicitly accounting for everything unique about that city‚Äďentrepreneurship rates, access to capital, number of universities, and more‚Äďand only adjusting for the relative performance of the target population. The Index assumes every city can achieve this BiC performance ratio.

The Index estimates the achievable for a specific Metro by applying the best-in-class ratios to the sub-factors computed for comparison entrepreneurs. For example, job creation in a manufacturing-heavy metro might be higher on average than in a FinTech focused Metro. By applying the ratio to real local productivity respects these differences. For example, an idealized mean jobs sub-factor that assumes target population entrepreneurs in every metro can achieve the same productivity ratio

Potential Jobs Creation Rate for Entrepreneurs Formula


Potential Job Creation Rate for individual Local Target Population Entrepreneurs Variable Term

is the potential job creation rate for individual local target population entrepreneurs.

The BiC model makes one additional assumption around entrepreneurship itself.  As previous research has revealed, women and LGBTQ+ individuals historically become entrepreneurs at a lower rate than men and straight individuals. Just as with job creation, this rate varies wildly by Metro. The Index computes a BiC entrepreneurship ratio from the top three cities for the target population.

To compute the aggregate achievable for an entire Metro, we take the statistically inferred population described above

Statistically Inferred Population Formula

and multiply it by our best-in-class model for each sub-factor. (Note that the LGBTQ+ entrepreneurship rate correction factor here is also computed using the same best-in-class methodology.) The achievable represents the full potential contribution of LGBTQ+ entrepreneurs for each sub-factor, for example

Jobs Achievable Formula

From that achievable estimate, the Index can then compute the achievement gap, the difference between the measured impact of local entrepreneurs and the BiC model of what that impact could be

Achievement Gap Formula

Each sub-factor has its own gap that are combined to produce the full achievement gap for each Metro

Full Achievement Gap for Metro Formula

The achievement gap is finally normalized to range between 0 and 100, becoming the Index Score reported by the Index.

Industry Fingerprint

To begin to explore the causes behind the achievement gap, the Index analyses likely factors that set the context for entrepreneurial success. One particularly relevant factor is the relative composition of industries within a given Metro region. The Index computes an industry fingerprint for each Metro to provide a visual summary of inclusion at the industry level.

To create this fingerprint, the Index uses the Global Industry Classification Standard (GICS) to count the number of companies in each industry. (Individual companies were allowed to be counted in more than one industry.) These counts were then normalized by the total number of companies in a given Metro, giving a proportion of industry in the Metro economy. Industry proportions were also computed at a national level. Finally, for each industry in each Metro, a likelihood ratio was computed by dividing the Metro proportion for that industry by the national proportion. This ratio indicates whether a given industry is over- or under-represented in a local economy compared to the nation at large.

For each Metro, the Index computes an industry weight as

Industry Weight Formula

where i is each Metro.

The Index computes an industry gap as a weighted average of the individual impact scores by the industry weights

Industry Gap Formula

The industry fingerprint then represents the size of the industry across all entrepreneurs within the Metro (target and comparison) as well as the weighted average of the achievement gap across the target population for the entire nation. This offers some insight into why a city might be particularly challenging or successful for a given target population.

Industry Weight Formula

Policy v1.0

Non-Policy Data

We collected non-policy data predominantly from the Census Bureau consisting of 200 or more different sociological, demographic, and economic features for every state and qualifying metro in our selected years of 2010 and 2020.  Economic data were collected from FRED, BLS, and BEA.  When data was unavailable for 2010 and 2020, we substituted data from the previous or subsequent years.

Data included age, gender, race, ancestry, place of birth, marital status, household and families, occupancy characteristics, children characteristics, means of transportation, commuting characteristics, educational attainment, school enrollment, nativity and citizenship status, language spoken at home, income and earnings, employment status, occupation, industry, poverty status, veteran status, voting status, as well as other social, economic, housing, financial, and economic characteristics.

We attempted to collect data on every state and territory in the United States, i.e. DC, Puerto Rico, Guam, etc.  For the qualifying metros, we matched the metros to existing Combined Statistical Areas (CSA) first and then any remaining metros to Metropolitan or Micropolitan Statistical Areas (MSA/uSA).  One notable exception is splitting the Washington DC CSA back into the Washington DC and Baltimore MSA’s since they are very distinct and different entrepreneurial spaces.  There are currently 119 qualifying metros: 63 of the metros were in CSA’s in 2010 while 87 of the metros were in CSA’s in 2020.

We relied on the Williams Institute as our resource for demographic data on the LGBT community on state-level and metro-level including racial makeup by state and metro area, age distribution, and socioeconomic indicators.  We use such data to calculate our estimated numbers for LGBTQ+ entrepreneurs in a given metro or state based on our top 3 best-in-class in the respective geographic region. 


Policy Data

Our policy data were collected from the Movement Advancement Project (MAP) and Fraser Institute Economic Freedom Rankings.  The former covers laws and policies in the following categories on the basis of sexual orientation and gender identity: Relationship and Parental Recognition, Nondiscrimination, Religious Exemption, LGBTQ Youth, Healthcare, Criminal Justice, and ID Documents.  The latter are more economic in nature and involve metrics in the following categories: Government Spending, Taxes, Labor Market Freedom, Legal System and Property Rights, Sound Money, and Freedom to Trade Internationally.  

For the policy data, when considering individual policies, laws, ordinances, regulations, we used the year of enactment to determine whether for 2010 and 2020 we would set the treatment dummy variable equal to ‚Äú0‚ÄĚ for untreated or not enacted (control group) and ‚Äú1‚ÄĚ for treated or enacted (treatment group).¬† For most policies, the metros inherited the policy data from the state, but there are particular policies with notable differences between metro-level ordinances and the state-level statutes or lack thereof, i.e. Nondiscrimination Statutes and Ordinances (Housing, Public Accommodations, Employment), Conversion Therapy Bans, etc.

When considering a feature with a continuous range of values such as some index, score, ranking or tally, such as those in the Fraser dataset, which is often a composite measure of several similar policies, we determined a threshold for each model based on the distribution of the values above which the treatment dummy variable for the state or metro was given a ‚Äú1‚ÄĚ for treatment group and below which the state was given a ‚Äú0‚ÄĚ for control group.


Target features

To calculate our values for the target features, we had to create two daughter files representing the two time periods in our observational study, i.e. 2000-2010 and 2011-2020.   Since we have access to the full scope of Crunchbase datasets, there is a lot more preprocessing and filtering involved.  For Pitchbook, we have requested only to receive companies receiving angel and venture capital funding, which automatically qualifies most if not all Pitchbook companies.


Mean Funding

In determining the 2000-2010 values for the target variables, we eliminated any companies with a founding date after 2010.  For the remaining Pitchbook companies, we were provided with the individual funding events up through 2010, but we used the aggregate sum after adjusting for inflation.  We performed deduplication on the Crunchbase and Pitchbook datasets, and for any Crunchbase company not listed in Pitchbook , we included any companies with any funding events through 2010, filtered out any events after 2010, and used the date listed for each event to adjust for inflation by year before calculating the aggregate sum for each company.  

In determining the 2010-2020 numbers for funding, we subtracted the 2010 inflation-adjusted total funding values from the 2020 total funding values for any Pitchbook company.  For Crunchbase companies, we looked at any funding events starting 2011 and adjusted for inflation based on the associated date, and calculated the aggregate sum by company.  

We aggregated the data by metro and state to calculate the mean funding raised by the companies within our various geographic areas.


Mean Jobs

Only Pitchbook provides cumulative job numbers at least once per year and their associated release dates.  We were provided the cumulative employee count as of 2010.  For the 2020 job numbers, we used the highest cumulative job number up through 2020 from our latest Pitchbook dataset.  To calculate the job totals for the 2010-2020 dataset, we subtracted the 2010 job numbers from the 2020 job numbers for any companies in common between the two datasets.

We aggregated the data by metro and state to calculate the mean jobs created by the geographic area’s companies.


Mean Exits

We gathered the exit-related columns from Crunchbase and Pitchbook to create our algorithm.  We searched for any IPO or acquisition event listed with its respective date.  Only companies with an IPO or acquisition with a reported date before 2011 were included in the 2000-2010 dataset and was assigned 1 for the exits feature, and any companies with IPO or acquisition dates with a reported date after 2010 were included in the 2010-2020 dataset and assigned 1 for the exits feature.

We calculated the mean values of the 0’s and 1’s for every metro and state to derive a number between 0 and 1, representing the probability of an exit within that geographic area.


Entrepreneurship Rate

We summed the counted or estimated number of founders for each metro and state.  We also calculate the scalar population per 100K for each metro and state.  The entrepreneurial rate is the sum of the founders divided by population per 100K:

Criteria for Qualified Companies

For the qualification process, we applied the same criteria on the 2000-2010 and 2010-2020 datasets for a company to be included as a ‚Äúhigh-growth‚ÄĚ company in our calculation of the target variables:

  1. raise at least 250,000 in funding (angel or VC funding), but any amount of VC funding would qualify as high-growth
  2. had an successful exit event (IPO or acquisition) 
  3. discovered at least one patent and created jobs beyond the founding team

We did not apply this final criteria as we did not have job numbers from Crunchbase and we are in the midst of revamping the algorithm for scraping the USPTO website.

Background and Prior Modeling

We had originally calculated outputs for each of our target populations (All, Women, and LGBTQ Entrepreneurs) for every state and qualifying metro, but due to the large amount of undocumented data from 2010 especially for LGBTQ+ entrepreneurs, it was impossible to produce reliable estimations for funding, jobs, patents, and exits for each target population.  Working with all entrepreneur data gives us a more robust dataset.  In the spirit of finding inclusive policies that help everybody, we care about how our policy data affects the output of the state, not how the data affects female or LGTQ+ entrepreneurs.

We also developed a model using the logarithmic transformations of each of the target variables (log_funding, log_jobs, log_exits, log_patents) since the transformed data followed a much more Gaussian distribution allowing for better fitting in our DID models.  However, when producing counterfactuals through those models, our estimates would be values in the logarithmic space, which we would then have to be exponentiated back into linear space to produce a human-readable value.  By passing our outputs into logarithmic space to produce a counterfactual and then passing them back into linear space by exponentiating it, the error was being compounded especially for the larger states and metros to produce unrealistically large estimates.

In order to mitigate this compounding effect, we eliminated using the total aggregate values of funding, jobs, and exits from our target features due to their logarithmic distribution.  Using totals conflated two things: counts of entrepreneurs and actual units of interest, so we broke the totals back down to their means and founder_count.


Current Manifestation

Mean funding, jobs and exits (funding_mean, jobs_mean, patents_mean, exits_mean) would be our choice target features for training the model since the mean values followed a more Gaussian distribution.  Since straight counts are logarithmically distributed, we created a new feature for modeling called entrepreneurship rate (ent_rate), which was calculated by dividing the counted or estimated entrepreneurs in a metro or state by population per 100,000.  We also created a new scalar feature population per 100,000 (pop_100k) for each metro and state, which allows us to calculate total funding and jobs.


Our total funding and job estimates are calculated from the following equations:

Multiplying the entrepreneurship rate by the population per 100,000 gives us the total number of founders in a given metro or state.  Multiplying the mean funding by the total number of founders then gives us the total funding raised by all entrepreneurs in a given metro or state.


Our new founder estimates are calculated by multiplying the entrepreneurship rate with the scalar value of population per 100000:

Our exit probability estimates come directly from our estimates for exits_mean, which is a number that falls between 0 and 1.  We take that decimal value as the probability

Sparse Principal Component Analysis Model

To account for confounding variables, we performed a Sparse Principal Component Analysis on the 200 plus non-policy features using the sklearn.decomposition.SparsePCA() function in the scikit-learn package to output either 5 or 6 principal components which we subsequently used as our covariates for the policy data in the DiD model.  We picked a sparse PCA over a traditional model PCA due to its interpretability and feature selection.  The sparsity constraint encourages a smaller number of non-zero coefficients in the principal components allowing for a more concise and meaningful representation of the underlying features.  It also automatically highlights the most relevant features and discards the less important ones.

We created an algorithm to determine whether each feature exhibited a Gaussian distribution  and for those that do not follow a normal distribution a transformation (logarithmic, square root, reciprocal, exponential, etc.) was applied until a more Gaussian distribution was achieved.  Any feature that did not follow a normal distribution even after applying all the transformations were eliminated.  Subsequently, the following preprocessing tasks were implemented:

  1. Imputation of any missing values with the median values with SimpleImputer()
  2. Removed the mean value and scaled to unit variance with StandardScaler() such that the values are all centered around zero.

Difference-in-Differences Model

The DiD model compares the difference in outcomes pre- (time=0) and post-treatment (time=1) between a treatment (treatment=1) and a control (treatment=0) group. The former refers to the group that receives the treatment or intervention, while the control group does not receive it and serves as a reference for comparison.  The key assumption of the DiD model is the parallel trends assumption, which states that, in the absence of treatment, the trends in outcomes for the two groups would follow a parallel path over time. This assumption implies that any difference in outcomes between them post-treatment can be attributed to the treatment effect itself.  By comparing the changes in outcomes over time, the DiD model helps to isolate the causal effect of the treatment from other confounding factors that may affect the outcomes.


The DiD (or “double difference”) estimator is defined as the difference in the mean outcome in the treatment group pre- and post-treatment minus the difference in the mean outcome in the control group pre- and post-treatment:¬†¬†

where T indicates treatment group, C indicates control group, 0 indicates pre-treatment, 1 indicates post-treatment, and Y is the outcome.


The general equation for the DiD model is the following:

where Y is the target variable, Treatment is a dummy variable indicating whether the unit is in the treatment group (1) or control group (0), Time is a dummy variable indicating the time period (pre-treatment: 0, post-treatment: 1), Treatment*Time is the Interaction term between Treatment and Time, őĪ, ő≤, ő≥, and őī are the coefficients to be estimated, and őĶ is the error term.¬†¬†

For individual policies, we assigned the values for the dummy variables using the following guidelines:


  • Treatment = 1 for any state/metro that had a year of enactment of 2020 or earlier.
  • Treatment = 0 for any state/metro that did not have a year of enactment of 2020 or earlier.¬†¬†
  • Pretreated = 1 for any state/metro with a year of enactment of 2010 or earlier.
  • Pretreated = 0 for any state/metro with a year of enactment after 2010.
  • Time = 0 for the data compiled for 2010.
  • Time =1 for the data compiled for 2020.

We used OLS in statsmodel to estimate the coefficients of the DiD model, where the coefficient of interest is őī, which represents the treatment effect.¬† After running the DiD model for each individual policy with each of our target variables (funding_mean, jobs_mean, exits_mean, ent_rate), we gathered certain metrics from each run of the linear regression and calculated a ‚Äúmeaningfulness‚ÄĚ score from the product of the negative log of the p-value for the treatment-time interaction term (which accounts for significance) and őī, the coefficient of the treatment-time interaction term (which accounts for effect size):

For funding_total and jobs_total, since they are derived from the product of two of our target parameters, funding_mean or jobs_mean and ent_rate.  To calculate the p-value, we first  calculated the standard error of the product of two parameters using the following equation:

and we calculated the covariance using the following equation:

To calculate the p-value from the standard error, we apply the scipy.stats.norm.sf() or survival function after calculating the z-score from the following equation:

We ranked each individual policy paired with each target variable (policy-target pair) by the magnitude of the meaningfulness metric and selected the top 10 for each target variable.  In calculating the counterfactuals for untreated states, we retrained the model for each of the selected policy-target pair and passed the non-policy sparse PCA principal components, the time dummy variable, the treatment dummy variable now changed to 1 for every untreated state to obtain the counterfactuals along with confidence interval.


We compiled all the counterfactuals for untreated states and selected the top 3 or 4 counterfactual values by magnitude for every state/metro and every target variable for display on the website.  This counterfactual value becomes our estimated increase in funding, jobs, exits, or founders if an untreated state or metro were to implement such a policy.


Policies can generally have a predominantly positive or negative impact on funding, jobs, exits, or founders.  For policies with a predominantly negative effect for treated states, we calculated the counterfactual values for the treated states instead.  This counterfactual value becomes our estimated amount of funding, jobs, exits, or founders a treated state or metro would recuperate if the policy were repealed. 

Join StartOut

Almost every LGBTQ+ entrepreneur has encountered unequal access to key resources needed to advance their business.