An Exploratory Data Analysis on The Indian Startup Ecosystem (2018 -2021)

Bright Eshun
18 min readJan 21, 2023

--

Source: www.startupindia.gov.in

1. Introduction

This project is one of my projects to complete in the Azubi Data Analytics Trainee program. The purpose of the project is for students to showcase their knowledge and skills acquired throughout the program.

The data for the analysis is a four CSV zipped file which contains data on Indian startup funding from 2018 to 2021. The source of the data is unknown. This article will guide us through the project using Data Analysis phases employed in the Google Data Analytics Course: Ask, Prepare, Process, Analyze, Share, and Act.

2. Ask

In this phase, we define the business objective, develop our hypothesis, raise questions to help accept or reject the developed hypothesis, and get some general insights into the data.

2.1 Business Task

The business task is to analyze the Indian startup funding data to give insights and answers to entrepreneurs seeking to venture into the Indian startup ecosystem by highlighting key metrics to consider before venturing.

2.2 Hypothesis

Null Hypothesis: Technology companies receive more funding than the rest of the sectors.
Alternate Hypothesis: Technology companies do not receive more funding than the rest of the sectors.

2.3 Assumptions

  • Amounts in the 2018 data will be interpreted as dollars if they do not begin with a currency symbol.
  • Missing values in the Stage column will be replaced with the value “Seed” if the company was funded the same year it was founded.
  • Identical company names will be treated as multiple fundings.

2.4 Questions

  1. How many Tech and Non Tech companies were funded?
  2. What was the trend for funding over the years. How many companies were funded each year?
  3. What were the Top Ten Cities with Most Startups?
  4. Did Companies receive multiple fundings through out the time period?
  5. Which sectors had most startups?
  6. Which Top 10 Investors funded more (different companies)startups?
  7. What was the highest average funding yearly?
  8. What is the sum of investments yearly?
  9. What is the sum of fundings by sector class(Tech, Non Tech, — Unkown)?
  10. Among the highly funded companies which of them were Tech companies?
  11. Which top 10 Funding Stages received most fundings? How much of it was used to fund Tech companies?
  12. Which top 10 Cities received most fundings? How much of it was used to fund Tech companies?

2.5 Deliverables

  1. A hypothesis.
  2. Questions to help gain insights into the data.
  3. A summary of the Analysis.
  4. Visualizations to communicate findings.
  5. Recommendations based on the analysis.

3. Prepare Data

In this phase, we will collect and store data for our analysis, but in our case, the data has been provided. We will determine the constraints of the data being used.

3.1 Information on Data

  1. The source of data is unknown.
  2. The data contains 4 comma-separated files: startup_funding2018.csv, startup_funding2019.csv, startup_funding2020.csv and startup_funding2021.csv.
  3. Below are columns in the data:
  • Company Name — The name of the company.
  • Founded — The year the company was founded.
  • Sector — The sector the company operates in.
  • Stage — The funding stage of the company.
  • Location — The location of the company’s headquarters.
  • Amount — The amount of funding the company received.
  • Description — The ‘About of the company’.
  • Investor — The investor(s) who funded the company.
  • Founder — The founders of the company.

3.2 Limitations of Data

  1. According to economictimes.indiatimes.com, an economic survey revealed that the government of India recognized 41,061 start-ups between 2020 and 2021. This information proves that most start-ups didn’t appear in the data.
  2. The 2018 start-up data has two missing columns (founders and founded) compared to the 2019, 2020, and 2021 datasets.
  3. I wouldn’t say the data is credible or reliable. This data would not be considered when making real-life recommendations for entrepreneurs.

3.3 Data Selection

All 4 csv files will be used for the analysis.

3.4 Tool

Python will be used for data cleaning, manipulation and visualization.

4. Process Data

In this phase, we will find and eliminate the errors, inaccuracies, and inconsistencies in the data that will get in the way of the analysis and results. Processing data means cleaning and transforming data to ensure it is relevant and completely free from errors.

These are the basic steps to follow.

  1. Explore and observe the data.
  2. Check for and treat missing values or null values.
  3. Check and transform data types.
  4. Merge all four datasets into one.

4.1 Loading packages

Let’s load the libraries that will be used for the project. The libraries were aliased to make them easy to work with.

# Data manipulation and cleaning
import pandas as pd # Data manipulation
import numpy as np # Data manipulation
import seaborn as sns # Data visualization
import matplotlib.pyplot as plt # Data visualization

4.2 Importing Datasets

We imported the selected files.

# load the datasets with pandas
df_2018 = pd.read_csv('startup_funding2018.csv')
df_2019 = pd.read_csv('startup_funding2019.csv')
df_2020 = pd.read_csv('startup_funding2020.csv')
df_2021 = pd.read_csv('startup_funding2021.csv')
print('Dataset loaded')

4.3 Data Cleaning and Manipulation

  1. Explore data.
  2. Check for null values.

Let’s preview the datasets and display the shape-number of rows and columns in the datasets. We will also check the data types.

# Check the datatypes
df_2018.info()
# display dataframe
df_2018.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
# Column Non-Null Count Dtype
- - - - - - - - - - - - - - -
0 Company Name 526 non-null object
1 Industry 526 non-null object
2 Round/Series 526 non-null object
3 Amount 526 non-null object
4 Location 526 non-null object
5 About Company 526 non-null object
dtypes: object(6)
memory usage: 24.8+ KB
The first 5 rows in the 2018 data.

Let’s check for null values in the data.

# check the null values in each column of the dateframe
df_2018_copy.isnull().sum()

Company Name 0
Industry 0
Round/Series 0
Location 0
About Company 0
Amount($) 148
Year of Funding 0
Funding Status 0
dtype: int64

The above steps were repeated for the 2019, 2020 and 2021 data frame

After previewing the data the following observations were made:

2018 Observations

  • The data frame has 6 columns and 525 rows.
  • The columns in 2018 are different from the 2019, 2020 and 2021 dataset.
  • The Amount column has a mixture of Indian rupees and US dollar symbols.
  • The Industry and Location columns have multiple information.

2019 Observations

  • Data frame had 9 columns and 89 rows.
  • The Founded column has a float data type.
  • The Amount column has object data type. It contains dollar sign(s) and commas.

2020 Observations

  • Data frame has 9 columns and 1054 rows.
  • The Amount column has object data type. It contains dollar sign(s) and commas.
  • There’s an erroneous column ‘Unnamed: 9.

2021 Observations

  • Data frame has 9 columns and 1208 rows.
  • The Amount column has object data type. It has dollar sign and commas.
  • The Founded column has a float data type

Now that we have noticed some of the issues with our data, we can clean and transform the data.

Steps involved in cleaning the data:

  • I removed duplicate rows in each data frame.
  • I applied string formatting to all columns except the amount columns which were formatted as numeric.
  • I split the values in the location and industry columns in the 2018 data frame using a comma as the delimiter and selected the first value as the primary sector.
  • In the 2018 data frame I extracted the numeric values from the amount and converted the ones with rupees sign into dollars using regular expression.
  • I created a column called ‘Year of Funding’ in the data frames for 2018, 2019, 2020, and 2021 and gave it a value of 2018, 2019, 2020, and 2021, respectively.
  • I created another column called ‘Funding Status’ to identify whether a funding amount was disclosed or undisclosed.
  • In the 2018 data frame, I replaced the missing values in the Amount column using the SimpleImputer.
  • The numpy null values used to replace ‘undisclosed’ in the Amount column were maintained. The original missing values in the data were replaced with the ‘mode’ using the SimpleImputer from Sklearn. This was applied to the 2019, 2020 and 2021 data frames.
  • In each data frame missing values in Funding Stage column was replaced with the value ‘Seed’, if the funding year was the same as the founded year.
  • I replaced misplaced and erroneous values in the respective rows.
  • I dropped the extra ‘unnamed’ column in the 2020 data frame.

After the data frames have been cleaned individually, they were merged. The 2019, 2020 and 2021 data frames were merged together first since their columns had the same names. The columns in the 2018 data frame were renamed before the data was merged into our new data

I performed additional data cleaning after combining the data frames.

  • The Funding Stage column in the combined data was cleaned to achieve 30 distinct stages using regular expressions.
  • I created a new column called New Sector, cleaned-up version of the ‘Sector’ column.
  • I created a new column called ‘Tech or Non-Tech’ to classify the values in the sector as Tech or Non-Tech.

Cleaning 2018 data frame

Let’s clean the amount column in 2018.

# replace commas and extract numbers to create a new column
df_2018['Amount($)'] = df_2018['Amount'].str.replace(',', '').str.extract(r'(\d+)').astype(float)

# Let's assume the amount with no amount sign is already in dollars
# loop through both Amount and Amount$ and convert Rupees into Dollars
for i in range(len(df_2018['Amount'])):
df_2018.reset_index(drop=True, inplace=True)
if df_2018.loc[i, 'Amount'][0] == '₹':
df_2018.loc[i, 'Amount($)'] = df_2018.loc[i, 'Amount($)'] * 0.0146

# create a copy of the 2018 datframe
df_2018_copy = df_2018.copy()
df_2018_copy.head()
Displaying the head of the 2018 data after cleaning the Amount column.

I will later drop the Amount column.

Let’s create new columns ‘Year of Funding’ and ‘Funding Status’.

# drop Amount column
df_2018_copy.drop(columns=['Amount'], inplace=True)

# Create Year of Funding column
df_2018_copy['Year of Funding'] = '2018'

#Create Funding Status column
df_2018_copy['Funding Status'] = 'Disclosed'

Let’s impute the missing values in the Amount column with the mode

# # fill null values with the mode in the Amount($) column
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent', missing_values=np.nan)
imputer_18 = imputer.fit(df_2018_copy[['Amount($)']])
df_2018_copy['Amount($)'] = imputer_18.transform(df_2018_copy[['Amount($)']])

After let's confirm the change in the number of missing values

# confirm null value have been replaced
df_2018_copy['Amount($)'].isnull().sum()

0

Cleaning 2019 data frame

Note 2019, 2020 and 2021 data frames had similar issues. Most of the cleaning methods applied to 2019 will be applied to 2020 and 2021.

Let’s create a function to clean the Amount($) column in 2019, 2020 and 2021 data frame.

# Changing the data type of Founded column to a string
def clean_create_cols(data, fund_year):
data['Amount($)'] = data["Amount($)"].replace({r"(\$$)": np.nan}, regex=True)
# create a funsing status column
data['Funding Status'] = data["Amount($)"].replace({r"(^[$\d].+)": "Disclosed"}, regex=True)
# Changing the data type of Founded column to a string
data["Amount($)"] = data["Amount($)"].replace({r"(\$U.+|\$u.+|U.+)": np.nan, "nan": np.nan}, regex=True)
data["Amount($)"] = data["Amount($)"].replace({r"(\D)": ""}, regex=True).astype(float)
# Adding a column to represent the year of funding
data["Year of Funding"] = fund_year

In rows where the “Year Founded” is the same as “Year of Funding”, the missing values in the Funding stage should be replaced with ‘Seed’.

# get indexes of rows with empty Funding stage
stage_null_2019 = founded_2019[founded_2019['Stage'].isnull()].index.tolist()
stage_null_2019
# fill stage null value with seed
df_2019_copy.at[stage_null_2019, 'Stage'] = 'Seed'

Cleaning Combined Dataset

Let’s combine the datasets.

# Joining all DataFrames with similar column names
combined_2019_2021 = pd.concat([df_2019_copy, df_2020_copy, df_2021_copy], ignore_index = True)
combined_2019_2021.columns = ["Company Name", "Year Founded", "Headquarters", "Sector", "Description", "Founders", "Investors", "Amount", "Funding Stage", "Funding Status", "Funding Year"]


# Renaming the columns in the 2018 dataframe to match with the other dataframes
df_2018_copy.columns = ['Company Name', 'Sector', 'Funding Stage', 'Headquarters', 'Description', 'Amount', 'Funding Year', 'Funding Status']


# Joining the 2018 DataFrame to the 2019-2021 DataFrame
combined_set = pd.concat([df_2018_copy, combined_2019_2021], ignore_index = True)
combined_set.head()
First 5 rows of the combined dataset.

Let’s clean the Funding Stage column.

Corrections for Funding Stage column.

# Run and effect changes in the dataset
combined_set['Funding Stage'] = combined_set['Funding Stage'].str.capitalize().replace(Stage_corrections, regex=True)

# get the number of unique stages
Stages = combined_set['Funding Stage'].sort_values().unique()
print(len(Stages))
Stages

30
array(['Angel', 'Bridge', 'Corporate round', 'Debt Financing', 'Edge',
'Fresh Funding', 'Grant', 'Mid series', 'Non-equity assistance',
'Post-ipo debt', 'Post-ipo equity', 'Pre Seed', 'Pre Series A',
'Pre Series B', 'Pre Series C', 'Pre-series', 'Private equity',
'Secondary market', 'Seed', 'Series A', 'Series B', 'Series C',
'Series D', 'Series E', 'Series F', 'Series G', 'Series H',
'Series i', 'Undisclosed', 'Unknown'], dtype=object)

Additional cleaning was done.

4.4 Feature Engineering

I realized the only numerical column in the combined data was the Amount column. I created two numerical columns ‘Age at Funding’ and ‘Age of Company’. This will help us incase we wanted find correlations between the features. ‘Age of Company’ is how old the company is and ‘Age at Funding’ is how old the company was at the time it received funding.

# Create new features/columns 'Company Age' and 'Age at Funding'
combined_set['Company Age'] = 2022 - combined_set['Year Founded']
combined_set['Age at Funding'] = combined_set['Funding Year'] - combined_set['Year Founded']

5. Analyze

In this phase we find the insights and relationships between the features/columns of the combined data. We can start by looking at the statistical summary.

We can check the following:

  • mean
  • 25%, 50%, 75% percentiles
  • count
  • standard deviation and
  • min and max values in our numerical columns
The statistical summary of data frame

From the figure above, the minimum amount received by a company was $876 and the maximum amount was $150,000,000,000. The mean and the 25%, 50%, 75% percentiles in the amount column were $114,274,992.7, $1,000,000, $2,500,000 and $10,000,000 respectively. Using the current year as 2022 we had a company that had been around for 59 years and was able to pull off getting funded as a startup at the time when it was 58 years old.

The rest of the analysis will be added to share phase. Visualizations will be provided to answer the questions in the Ask phase.

6. Share

In this phase we will allow the data to tell its own story using visualizations. Visualizations will be used to communicate the findings and answers to the questions raised about the data.

6.1 Communicating Findings

Univariate Analysis

How many Tech and Non-Tech companies were funded?

# count the values in the 'Tech or Non Tech' column
sector_class = combined_no_duplicates['Tech or Non-Tech'].value_counts()
sector_class

Tech 1176
Non Tech 994
Unknown 44
Name: Tech or Non-Tech, dtype: int64


# plot a bar chart represeting the count of the sector class
plt.figure(figsize=(8, 5))
sns.countplot(data=combined_set, x='Tech or Non-Tech')
plt.title('Count of Sector Class')
plt.xlabel('Sector Class')
plt.show();
A bar chart displaying number each sector class.
  • Through the period of 2018 to 2021 more than 1150 Tech companies were funded.
  • There are about 994 non tech companies that were funded too. The rest of companies had the ‘Unknown’ sector class.

What was the trend of funding over the years. How many companies were funded each year?

# plot a line gragh to show the trend of startup with the period 2018–2021
plt.figure(figsize=(16, 5))
plt.subplot(1, 2, 1)
funding_year_count = combined_set.groupby(['Funding Year'])['Company Name'].count()
funding_year_count.plot();
plt.title('Trend of Startups over the period')


# plot a bar chart to show the number of startups within each year
plt.subplot(1, 2, 2)
sns.countplot(
x='Funding Year',
data=combined_set,
color=base_color)
plt.title('Number of Companies Funded In The Year 2018–2021')
plt.show()
Charts showing the trend of startups over the year(2018 to 2021).
  • The trend in the Indian startup ecosystem has been increasing yearly except in tne year 2019 where the number of startups dropped, from more than 500 startups in 2018 to less than 100 startups in 2019. We can expect an increase in the number startup in the year after 2021.

What are the Top Ten Cities with Most Startups?

# count the number of startups in each city
top_ten_HQ = combined_set['Headquarters'].value_counts().head(10).sort_values()
top_ten_HQ

Hyderabad 76
Gurgaon 80
Noida 86
Delhi 88
Pune 104
Chennai 106
New Delhi 230
Gurugram 238
Mumbai 468
Bangalore 859
Name: Headquarters, dtype: int64


# plot a horinzontal bar chart to show the number of startups in each city
plt.figure(figsize=(12, 5))
top_ten_HQ.plot(kind='barh')
plt.title('Top Ten Cities with Most Startups')
plt.xlabel('Count')
plt.ylabel('Cities');
A horizontal bar chart representing the number of startups in the top cities.
  • Cities that headquartered most startups within the period 2018 to 2021 were in the order Bangalore, Mumbai, Gurugram, New Delhi,, Chennai, Pune, Delhi, Noida, Gurgaon, Hyderabad with 859, 468, 238, 230, 106, 104, 88, 86, 80, 76 startups respectively.
  • Bangalore has almost twice the number of startups in India’s largest city Mumbai. There is a higher chance to be funded as a startup if you have your headquarters in any of these cities.

Did companies receive multiple fundings through out the time period?

# get the fraction of the number of company that received funding once, twice and three times or more
total = len(number_of_fundings)
number_of_fundings['Number of Fundings'] = number_of_fundings['Number of Fundings'].astype(int)
one_funding = len(number_of_fundings[number_of_fundings['Number of Fundings'] == 1]) / total
two_funding = len(number_of_fundings[number_of_fundings['Number of Fundings'] == 2]) / total
multiple_funding = len(number_of_fundings[number_of_fundings['Number of Fundings'] >= 3]) / total
print(one_funding +two_funding+ multiple_funding)

# plot a bar chart to show the fractions or percentages for fundings
plt.figure(figsize=(8, 5))
locations = [1, 2, 3]
heights = [one_funding, two_funding, multiple_funding]
labels = ['Once Funded', 'Twice Funded', 'Multiple Funded']
plt.bar(locations, heights, tick_label=labels)
plt.title('Funding Frquecy')
plt.xlabel('Number of times funded')
plt.ylabel('Fraction ')
A bar chart representing the frequency at which companies were funded.
  • Over the period of four years only 6% of the total number of funded startups were funded more than twice. Around 13 percent of the total number of startup were funded twice, the rest were all funded once. There isn’t high possibility of being funded more than twice within even a longer time period.

Which sector had most startups?

# check the top 10 cities with the most startups
Top_ten_sectors = combined_set['New Sector'].value_counts().head(10)
Top_ten_sectors

Fintech 276
Edtech 261
Healthcare and Wellness 169
Financial Services 168
E-commerce 128
Food and Nutrition 103
IT 81
Automotive 80
Artificial Inteligence 76
Technology 67
Name: New Sector, dtype: int64


# plot a bar chart to show the top 10 sectors with the most number of startups
plt.figure(figsize=(8, 5))
Top_ten_sectors.sort_values().plot(kind='barh')
plt.title('Top 10 Sectors with the Most Startup')
plt.xlabel('Number of Startups')
plt.ylabel('Sector');
A horizontal bar chart representing number of startups in the top sectors.
  • The top 10 sectors with the most startups were in the order of- from first to tenth: Fintech, Edtech, Healthcare and Wellness , Financial Services, E-commerce, Food and Nutrition, IT, Automotive, Artificial Intelligence, Technology with 276, 261, 169, 168, 128, 103, 81, 80, 76 and 67 startups respectively.
  • Entrepreneurs can venture into these top 10 sectors to increase their chances of getting funded. Out of these sectors , 7 were tech companies.

Which Top 10 Investors funded more (different companies)startups?

Top_10_investors_ = combined_no_duplicates['Investors'].value_counts().head(10)
Top_10_investors_

Inflection Point Ventures 28
Venture Catalysts 25
Mumbai Angels Network 16
Angel investors 14
Titan Capital 11
Undisclosed 10
Unicorn India Ventures 10
Sequoia Capital India 7
Better Capital 7
Elevation Capital 6
Name: Investors, dtype: int64


plt.figure(figsize=(8, 5))
Top_10_investors_.sort_values().plot(kind='barh')
plt.title('Top 10 Investors who funded different startups')
plt.xlabel('Number of Startups')
plt.ylabel('Investor');
A bar chart representing the investor involved with different companies.
  • Together these Investors funded more than 120 startups. Inflection Point Ventures and Venture Catalysts funded 28 and 25 different companies respectively. We can later find out which sectors or sector class these investors were more involved.

Multivariate Analysis

What is the highest average funding yearly?

# get the average(mean) funding yearly
average_funding_year= combined_set.groupby(['Funding Year']).agg({'Amount': 'mean'})
average_funding_year.reset_index(inplace=True)
average_funding_year

Funding Year Amount
0 2018 12932425.1
1 2019 43330301.3
2 2020 112950185.8
3 2021 171218804.6

# plot a bar chart to show the avearge funding yearly
plt.figure(figsize=(22, 10))
plt.subplot(1, 2, 1)
sns.barplot(
data=average_funding_year,
x='Funding Year',
y='Amount',
color=base_color)
plt.title('Average Funding Per Year')

# plot a box plot to show the avearge funding yearly
plt.subplot(1, 2, 2)
sns.boxplot(data=combined_set, y='Amount', x='Funding Year')
plt.title('Funding vs Year');
plt.ylim(-10,80000000);
Charts showing the distribution of funding amounts in the various years.
  • Looking at the bar chart graph, the average funding by mean has been increasing yearly. From around 13M dollars in 2018 to 43M dollars in 2019 to 113M dollars in 2020 and to 171 M dollars in 2021.
  • From the boxplot the year 2019 has the highest average by median compared to the other years. This was due to 2019 having a very small number of startups recorded. Most of the amounts in 2018, 2020 and 2021 were treated as outliers. Considering the current trend we can anticipate/assume the median funding in the subsequent years will increase or be high.

What is the sum of investments yearly?

# get the sum of fundings per year
sum_funding_year= combined_set.groupby(['Funding Year']).agg({'Amount': 'sum'})
sum_funding_year.reset_index(inplace=True)
sum_funding_year

Funding Year Amount
0 2018 6789523177.0
1 2019 3336433200.0
2 2020 90924899604.0
3 2021 179608526000.0

# plot a bar chart to show te sum of funings in ech year
plt.figure(figsize=(8, 5))
sns.barplot(
data=sum_funding_year,
x='Funding Year',
y='Amount',
color=base_color)
plt.title('Sum of Funding Per Year')
plt.show()
A bar chart representing the sum of fundings in the years.
  • Clearly the year with the highest number of startups had the highest sum.

What is the sum of fundings by sector class(Tech, Non Tech, Unkown)?

# group the dataframe by sector class and get the sum
sum_sector_class = combined_set.groupby(['Tech or Non-Tech'])['Amount'].sum()
sum_sector_class

Tech or Non-Tech
Non Tech 98776168099.0
Tech 181567569962.0
Unknown 315643920.0
Name: Amount, dtype: float64

# plot a bar chart to show the sum of the fundings in each sector class
plt.figure(figsize=(8, 5))
sum_sector_class.plot(kind='bar')
plt.title('Sum Fundings by Sector class', pad=40)
plt.ylabel('Sum')
plt.xlabel('Sector Class')
plt.show()
A bar chart representing the sum of investment by sector class(Tech, Non Tech, Unknow).
  • The total amount of funds received by Tech companies was more than Non Tech companies. It was around 181B dollars for Tech against the 99B dollars for Non Tech. The amount for “Unknown” isn’t visible because compared to the rest it was very small.

Among the highly funded companies which of them were Tech companies?

# get the median amountin the dataframe
median = combined_set['Amount'].median()
# create filter of the companies that received funding greater than the median value
highly_funded = combined_no_duplicates.query('Amount > {}'.format(median))
# count the nnumber of companies in the sector class
highly_funded_companies = highly_funded['Tech or Non-Tech'].value_counts()
highly_funded_companies

Tech 440
Non Tech 347
Unknown 12
Name: Tech or Non-Tech, dtype: int64

# plot a bar chart to show the number of highly funded companies in each sector class
plt.figure(figsize=(8, 5))
highly_funded_companies.plot(kind='bar')
plt.title('Number highly funded compines ', pad=40)
plt.ylabel('Number of startups')
plt.xlabel('Sector Class')
plt.show()
A bar chart displaying the number of startups by sector class.
  • From the chart, the Tech sector class has the highest number highly funded startups. It has 440 startups compared to Non Tech’s 347.

Which top 10 Cities received most fundings? How much of it was used to fund Tech companies?

# Used a stacked barplot to visualize the data
stacked_cities_df_.plot(kind='barh', stacked=True, color=['red', 'skyblue', 'green'], figsize=(20, 8))
plt.title('Investments by cities according to sector class', pad=40)
plt.ylabel('Headquarters')
plt.xlabel('Amount')
plt.xlim(0, 2.35e11)
plt.show()
A stacked bar chart showing the amount of funding invested in startups in the top cities by sector class.
  • From the graph, Mumbai startups has more than 230billion dollars in fundings. Almost two-thirds of the sum of fundings for Mumbai startups were used to fund Tech companies. Tech companies that headquartered in California had all of their fundings being invested in Tech companies. Bangalore, Chennai, Delhi, New Delhi and Prune had more of their fundings used to invest in Tech companies.

Which top 10 Funding Stages received most fundings? How much of it was used to fund Tech companies?

# Used a stacked barplot to visualize the data
stacked_stages_df_.plot(kind='barh', stacked=True, color=['red', 'skyblue', 'green'], figsize=(20, 8))
plt.title('Investments by Funding stages according to sector class', pad=40)
plt.ylabel('Funding Stage')
plt.xlabel('Amount')
# plt.xlim(0, 0.41e11)
plt.show()
A stacked bar chart showing the amount of funding invested in startups in the top funding stages by sector class.
  • From the graph, 9 of the Funding stages had some of their fundings used to fund Tech companies. More than 99% of the amount in the Debt Financing was used to fund Tech companies. This shows that investor were comfortable funding Tech companies who were in debt. Series A, B, C and D, F all had slightly above 50% of their amount being used to fund Tech companies.

7. Act

In this phase we give recommendations to our stakeholders, explain the meanings of the findings we have achieved so far and how we can use it to make data driven decisions.

Hypothesis: Based on the questions answered so far we can go on to accept our hypothesis. Companies in the tech industry are mostly funded and also receive higher fundings than the rest (non tech startups).

7.1 Recommendations

1. Entrepreneurs who are considering starting a company should consider venturing into Tech. Most the companies that were funded were Tech companies and also they were the same ones that received higher investments.

2. Entrepreneurs should consider headquartering their companies in Mumbai, Bangalore and Gurugram. These cities are in the top three cities by both number of startups and by the sum of fundings generated.

3. During the Pre seed and Seed stage of the startup journey, entrepreneurs should seek fundings from family and friends since it gives them a certain flexibility. Loans from families and friends may be without security or less security than banks. Families and friends may also lend funds interest-free or at a low rate. A company received funding as low as $ 876. This money might come with no interest if given out by a family or friend.

4. Entrepreneurs venturing into startups can consider the following sectors Fintech, Edtech, Healthcare and Wellness , Financial Services, E-commerce. Entrepreneurs who venture into these sectors might increase their chances of getting funded-not necessarily a huge amount.

The datasets and the complete code can be found on my github. There might be more visualizations and questions in there than in this article since i will still be working on the project.

Acknowledgements

I would like to acknowledge the entire Azubi Data Analytics training team Marvin Lomo, Richard Kadey, Racheal Appiah-kubi and Emmanuel KOUPOH for their immense support and guidance throughout the data analytics and Python journey. They have really been wonderful and they are always pushing us against our limits

I would also like to acknowledge Katie Huang Xiemin. The Template for this article was adopted from her article on her Google Data Analytics Capstone Project. Here is a link to her project..

I would love to receive comments, suggestions and corrections on my work. Thank you.

--

--

Bright Eshun

Multi-dimensional data scientist, programmer, and cloud computing enthusiast with a talent for crafting engaging narratives. Follow for innovative insights.