STUDY ON UBER DATASET¶

Objectives of the study¶

To make a basic EDA on the data
To study and visualise different parameters with different hypothesis testing.
To find the relationship between different parameters

%matplotlib inline

import datetime as dt
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')

var = pd.read_csv('/content/Uber project dataset.csv')

var.head(10)

var.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1155 entries, 0 to 1154
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Start_date      1155 non-null   object 
 1   Start_day       1155 non-null   object 
 2   Start_time      1155 non-null   object 
 3   SHour           1155 non-null   int64  
 4   End_date        1155 non-null   object 
 5   End_day         1155 non-null   object 
 6   End_time        1155 non-null   object 
 7   Duration        1155 non-null   object 
 8   Duration_min    1155 non-null   int64  
 9   Category        1155 non-null   object 
 10  Pickup          1155 non-null   object 
 11  Destination     1155 non-null   object 
 12  KM              1155 non-null   float64
 13  Purpose         1141 non-null   object 
 14  Driver_Rating   1155 non-null   int64  
 15  Fare            1155 non-null   float64
 16  Tip             1155 non-null   int64  
 17  Payment_Type    1155 non-null   object 
 18  Payment_Method  1155 non-null   object 
 19  Status          1155 non-null   object 
dtypes: float64(2), int64(4), object(14)
memory usage: 180.6+ KB

print("Prints all the columns that are of numerical type")
var.select_dtypes(include=['number'])

Prints all the columns that are of numerical type

print("Prints all the columns that are of non-numerical type")
var.select_dtypes(include=['object'])

Prints all the columns that are of non-numerical type

Extracting date parameters - Date, Day, Month, Year from the date column

var['Start_date']=pd.to_datetime(var['Start_date'])

var['Date'] = var['Start_date'].dt.day
var['Month'] = var['Start_date'].dt.month
var['Year'] = var['Start_date'].dt.year
var['Weekday']=var['Start_date'].dt.day_name()

print(var)

     Start_date Start_day Start_time  SHour   End_date   End_day  End_time  \
0    2016-01-01    Friday   22:54:00     22  01-Jan-16    Friday  01:39:00   
1    2016-01-01    Friday   13:42:00     13  01-Jan-16    Friday  13:54:00   
2    2016-01-01    Friday   22:46:00     22  01-Jan-16    Friday  22:58:00   
3    2016-01-03    Sunday   12:57:00     12  03-Jan-16    Sunday  13:08:00   
4    2016-01-03    Sunday   20:24:00     20  03-Jan-16    Sunday  20:40:00   
...         ...       ...        ...    ...        ...       ...       ...   
1150 2016-12-30    Friday   20:48:00     20  30-Dec-16    Friday  20:57:00   
1151 2016-12-31  Saturday   16:21:00     16  31-Dec-16  Saturday  16:33:00   
1152 2016-12-31  Saturday   13:22:00     13  31-Dec-16  Saturday  13:41:00   
1153 2016-12-31  Saturday   19:41:00     19  31-Dec-16  Saturday  19:50:00   
1154 2016-12-31  Saturday   19:30:00     19  31-Dec-16  Saturday  19:39:00   

      Duration  Duration_min  Category  ... Driver_Rating    Fare  Tip  \
0     02:45:00           165  Business  ...             0  6663.0    0   
1     00:12:00            12  Personal  ...             4   175.2   17   
2     00:12:00            12  Personal  ...             0   132.3   13   
3     00:11:00            11  Business  ...             3   171.9   17   
4     00:16:00            16  Business  ...             3   132.3   13   
...        ...           ...       ...  ...           ...     ...  ...   
1150  00:09:00             9  Business  ...             4   129.0   12   
1151  00:12:00            12  Business  ...             1   218.1   21   
1152  00:19:00            19  Business  ...             5   373.2   37   
1153  00:09:00             9  Business  ...             0   102.6   10   
1154  00:09:00             9  Business  ...             5   125.7   12   

     Payment_Type  Payment_Method     Status  Date Month  Year   Weekday  
0            Card            Card  Completed     1     1  2016    Friday  
1         Digital       Uber Cash  Completed     1     1  2016    Friday  
2         Digital       Gift Card  Completed     1     1  2016    Friday  
3            Card            Card  Completed     3     1  2016    Sunday  
4         Digital       Uber Cash  Completed     3     1  2016    Sunday  
...           ...             ...        ...   ...   ...   ...       ...  
1150      Digital       Uber Cash  Completed    30    12  2016    Friday  
1151      Digital       Uber Cash  Completed    31    12  2016  Saturday  
1152      Digital    UPI/e-Wallet  Completed    31    12  2016  Saturday  
1153      Digital       Uber Cash  Completed    31    12  2016  Saturday  
1154         Card            Card  Completed    31    12  2016  Saturday  

[1155 rows x 24 columns]

[var.isna().any(axis=1)]

[0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1150    False
 1151    False
 1152    False
 1153    False
 1154    False
 Length: 1155, dtype: bool]

var[var.isna().any(axis=1)]

var.dropna(axis=0, how="any", inplace=True)
var.shape

(1141, 24)

Creating a pandas dataframe

df=pd.DataFrame(var)
print(df)

     Start_date Start_day Start_time  SHour   End_date   End_day  End_time  \
0    2016-01-01    Friday   22:54:00     22  01-Jan-16    Friday  01:39:00   
1    2016-01-01    Friday   13:42:00     13  01-Jan-16    Friday  13:54:00   
2    2016-01-01    Friday   22:46:00     22  01-Jan-16    Friday  22:58:00   
3    2016-01-03    Sunday   12:57:00     12  03-Jan-16    Sunday  13:08:00   
4    2016-01-03    Sunday   20:24:00     20  03-Jan-16    Sunday  20:40:00   
...         ...       ...        ...    ...        ...       ...       ...   
1150 2016-12-30    Friday   20:48:00     20  30-Dec-16    Friday  20:57:00   
1151 2016-12-31  Saturday   16:21:00     16  31-Dec-16  Saturday  16:33:00   
1152 2016-12-31  Saturday   13:22:00     13  31-Dec-16  Saturday  13:41:00   
1153 2016-12-31  Saturday   19:41:00     19  31-Dec-16  Saturday  19:50:00   
1154 2016-12-31  Saturday   19:30:00     19  31-Dec-16  Saturday  19:39:00   

      Duration  Duration_min  Category  ... Driver_Rating    Fare  Tip  \
0     02:45:00           165  Business  ...             0  6663.0    0   
1     00:12:00            12  Personal  ...             4   175.2   17   
2     00:12:00            12  Personal  ...             0   132.3   13   
3     00:11:00            11  Business  ...             3   171.9   17   
4     00:16:00            16  Business  ...             3   132.3   13   
...        ...           ...       ...  ...           ...     ...  ...   
1150  00:09:00             9  Business  ...             4   129.0   12   
1151  00:12:00            12  Business  ...             1   218.1   21   
1152  00:19:00            19  Business  ...             5   373.2   37   
1153  00:09:00             9  Business  ...             0   102.6   10   
1154  00:09:00             9  Business  ...             5   125.7   12   

     Payment_Type  Payment_Method     Status  Date Month  Year   Weekday  
0            Card            Card  Completed     1     1  2016    Friday  
1         Digital       Uber Cash  Completed     1     1  2016    Friday  
2         Digital       Gift Card  Completed     1     1  2016    Friday  
3            Card            Card  Completed     3     1  2016    Sunday  
4         Digital       Uber Cash  Completed     3     1  2016    Sunday  
...           ...             ...        ...   ...   ...   ...       ...  
1150      Digital       Uber Cash  Completed    30    12  2016    Friday  
1151      Digital       Uber Cash  Completed    31    12  2016  Saturday  
1152      Digital    UPI/e-Wallet  Completed    31    12  2016  Saturday  
1153      Digital       Uber Cash  Completed    31    12  2016  Saturday  
1154         Card            Card  Completed    31    12  2016  Saturday  

[1141 rows x 24 columns]

df.nunique()

Start_date        352
Start_day           7
Start_time        685
SHour              23
End_date          352
End_day             7
End_time          675
Duration           99
Duration_min       99
Category            2
Pickup            175
Destination       183
KM                256
Purpose            10
Driver_Rating       6
Fare              256
Tip                86
Payment_Type        3
Payment_Method      5
Status              2
Date               31
Month              12
Year                1
Weekday             7
dtype: int64

df.describe()

Findings about the parameters: Fare and Distance Travelled

plt.figure(figsize=(10,8))
plt.hist(df['Fare'],bins=50)

(array([629., 305., 119.,  24.,  13.,   9.,   7.,   8.,   1.,   1.,   2.,
          1.,   3.,   1.,   1.,   1.,   1.,   1.,   1.,   0.,   0.,   2.,
          0.,   2.,   0.,   2.,   0.,   0.,   1.,   1.,   0.,   3.,   1.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   1.]),
 array([   46.5  ,   250.968,   455.436,   659.904,   864.372,  1068.84 ,
         1273.308,  1477.776,  1682.244,  1886.712,  2091.18 ,  2295.648,
         2500.116,  2704.584,  2909.052,  3113.52 ,  3317.988,  3522.456,
         3726.924,  3931.392,  4135.86 ,  4340.328,  4544.796,  4749.264,
         4953.732,  5158.2  ,  5362.668,  5567.136,  5771.604,  5976.072,
         6180.54 ,  6385.008,  6589.476,  6793.944,  6998.412,  7202.88 ,
         7407.348,  7611.816,  7816.284,  8020.752,  8225.22 ,  8429.688,
         8634.156,  8838.624,  9043.092,  9247.56 ,  9452.028,  9656.496,
         9860.964, 10065.432, 10269.9  ]),
 <a list of 50 Patch objects>)

plt.figure(figsize=(10,8))
plt.hist(df['KM'], bins=50)

(array([629., 305., 119.,  24.,  13.,   9.,   7.,   8.,   1.,   1.,   2.,
          1.,   3.,   1.,   1.,   1.,   1.,   1.,   1.,   0.,   0.,   2.,
          0.,   2.,   0.,   2.,   0.,   0.,   1.,   1.,   0.,   3.,   1.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   1.]),
 array([  0.5  ,   6.696,  12.892,  19.088,  25.284,  31.48 ,  37.676,
         43.872,  50.068,  56.264,  62.46 ,  68.656,  74.852,  81.048,
         87.244,  93.44 ,  99.636, 105.832, 112.028, 118.224, 124.42 ,
        130.616, 136.812, 143.008, 149.204, 155.4  , 161.596, 167.792,
        173.988, 180.184, 186.38 , 192.576, 198.772, 204.968, 211.164,
        217.36 , 223.556, 229.752, 235.948, 242.144, 248.34 , 254.536,
        260.732, 266.928, 273.124, 279.32 , 285.516, 291.712, 297.908,
        304.104, 310.3  ]),
 <a list of 50 Patch objects>)

From the above histograms, most of the fare amount falls between 0 - 2000 range and most of the distance travelled is shorter comparatively (0-50 Km).It is also found that the parameters "Kms travelled" and "Fare" are not normally distributed. In order to prove this statistically,we perform the following test:

Shapiro test for normality testing

from scipy.stats import shapiro

norm=df['Fare']
stat, p = shapiro(norm)

print('stat=%.2f,p=%.30f'% (stat,p))

if p>0.05:
  print("Fare data normally distributed")
else:
  print("Fare data not normally distributed")

stat=0.35,p=0.000000000000000000000000000000
Fare data not normally distributed

norm=df['KM']
stat, p = shapiro(norm)

print('stat=%.2f,p=%.30f'% (stat,p))

if p>0.05:
  print("Km data normally distributed")
else:
  print("Km data not normally distributed")

stat=0.35,p=0.000000000000000000000000000000
Km data not normally distributed

So, it is statistically proved that the considered data are not normally distributed by considering a 95% confidence.

Outliers of fare and distance parameters

From the previous section, it is found that the parameters "KM" and "Fare" are not normally distributed. So we can't able to perform z test for outlier testing. We can work with the outliers using the Five- Point Data Desctiption to work with outliers.

At first, we are creating Box plots for the parameters to visualize the outliers.

plt.figure(figsize=(10,6))
sns.boxplot(x=df["Fare"])

<matplotlib.axes._subplots.AxesSubplot at 0x7f5abe8d3a10>

plt.figure(figsize=(10,6))
sns.boxplot(x=df["KM"])

<matplotlib.axes._subplots.AxesSubplot at 0x7f5abee86290>

In this case, removing all the outliers of Fare and Distance parameters is not considerable, since they are the parameters that are eventual to happen. Instead, it is better to calculate a new parameter, Fare per km and removing the outliers based on that would be considerable, since the outliers in this case might be due to unexpected surge.

Price/km distribution

df["Fare_per_km"] = df["Fare"] / df["KM"]
plt.hist(df["Fare_per_km"])

(array([653., 262., 121.,  47.,  28.,   8.,   6.,   8.,   5.,   3.]),
 array([33.09668063, 39.08701257, 45.07734451, 51.06767644, 57.05800838,
        63.04834032, 69.03867225, 75.02900419, 81.01933613, 87.00966806,
        93.        ]),
 <a list of 10 Patch objects>)

sns.boxplot(df["Fare_per_km"])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<matplotlib.axes._subplots.AxesSubplot at 0x7f5abe6fb5d0>

df["Fare_per_km"].describe()

count    1141.000000
mean       41.062355
std         8.363648
min        33.096681
25%        35.884615
50%        38.000000
75%        43.344828
max        93.000000
Name: Fare_per_km, dtype: float64

Finding the quantiles of price/km parameter

def outer_boundary(data, variable):
  Q1 = data[variable].quantile(.25)
  Q3 = data[variable].quantile(.75)
  IQR = Q3 - Q1
  low_bound = Q1 - 1.5*IQR
  up_bound = Q3 + 1.5*IQR

  return low_bound, up_bound

low, up = outer_boundary(df,"Fare_per_km")
print("Lower limit = ", low)
print("Upper limit = ", up)

Lower limit =  24.694297082228104
Upper limit =  54.53514588859417

The data falling away from the above boundaries are outliers. Here the values greater than 54.535 are outliers.

Removal of outliers with reference to the lower and upper limit of the 5 point data description

df_outliers = np.where(df["Fare_per_km"] > up,True,np.where(df["Fare_per_km"] < low,True,False))

df.shape

(1141, 25)

df_new = df.loc[~(df_outliers),]

df_new.shape

(1075, 25)

df_new["Fare_per_km"].max()

54.42857142857143

df_new["Fare_per_km"].min()

33.09668063164679

df_new.describe()

sns.boxplot(y=df_new["Fare_per_km"])  #The outliers correponding to Fare per km of the dataset are removed and the new dataset aligns with the new IQR, Q1 and Q3 values. In this case, the below plot showing some outliers are specific to this updated dataset are considerable.

<matplotlib.axes._subplots.AxesSubplot at 0x7f5abe6cfa10>

df_new.describe() - df.describe()

From the above steps, the outliers based on fare per km are removed, due to which the 5 Point data description aligned with the new dataset, which produced some outliers for the fare per km value, but those values are considerable.

Also the difference in the 5 point data description before and after removing the outliers also presented.

df_new.nunique()

Start_date        350
Start_day           7
Start_time        663
SHour              23
End_date          350
End_day             7
End_time          653
Duration           99
Duration_min       99
Category            2
Pickup            160
Destination       167
KM                247
Purpose            10
Driver_Rating       6
Fare              247
Tip                83
Payment_Type        3
Payment_Method      5
Status              2
Date               31
Month              12
Year                1
Weekday             7
Fare_per_km       247
dtype: int64

Max and Min Fare details:

print("Max Fare details:")
print(" ")
print("Max Fare:",df_new.Fare[df_new.Fare==df_new.Fare.max()])
print("KM for Max Fare:",df_new.KM[df_new.Fare==df_new.Fare.max()])
print("Date:",df_new.Start_date[df_new.Fare==df_new.Fare.max()])
print("Date:",df_new.Weekday[df_new.Fare==df_new.Fare.max()])

Max Fare details:
 
Max Fare: 131    10269.9
Name: Fare, dtype: float64
KM for Max Fare: 131    310.3
Name: KM, dtype: float64
Date: 131   2016-02-06
Name: Start_date, dtype: datetime64[ns]
Date: 131    Saturday
Name: Weekday, dtype: object

print("Min Fare details:")
print(" ")
print("Min Fare:",df_new.Fare[df_new.Fare==df_new.Fare.min()])
print("KM for Min Fare:",df_new.KM[df_new.Fare==df_new.Fare.min()])
print("Date:",df_new.Start_date[df_new.Fare==df_new.Fare.min()])
print("Date:",df_new.Weekday[df_new.Fare==df_new.Fare.min()])

Min Fare details:
 
Min Fare: 11      76.2
57      76.2
67      76.2
301     76.2
317     76.2
337     76.2
388     76.2
513     76.2
559     76.2
583     76.2
930     76.2
1039    76.2
1047    76.2
1050    76.2
Name: Fare, dtype: float64
KM for Min Fare: 11      1.4
57      1.4
67      1.4
301     1.4
317     1.4
337     1.4
388     1.4
513     1.4
559     1.4
583     1.4
930     1.4
1039    1.4
1047    1.4
1050    1.4
Name: KM, dtype: float64
Date: 11     2016-01-06
57     2016-01-19
67     2016-01-22
301    2016-03-31
317    2016-04-06
337    2016-04-12
388    2016-04-28
513    2016-06-09
559    2016-06-29
583    2016-07-05
930    2016-10-22
1039   2016-11-29
1047   2016-11-30
1050   2016-11-30
Name: Start_date, dtype: datetime64[ns]
Date: 11      Wednesday
57        Tuesday
67         Friday
301      Thursday
317     Wednesday
337       Tuesday
388      Thursday
513      Thursday
559     Wednesday
583       Tuesday
930      Saturday
1039      Tuesday
1047    Wednesday
1050    Wednesday
Name: Weekday, dtype: object

STUDY OF RELATIONSHIP BETWEEN DIFFERENT PARAMETERS¶

1 a. Correlation between Distance Travelled and Fare¶

In general, it is known that whenever distance increases, Fare also inceases. In order to prove this, we are going to perform the collowing correlation tests:

Null Hypothesis: There's no correlation between Distance Travelled and Fare.

Alternate Hypothesis: There's a correlation between Distance Travelled and Fare.

Test condition: For a 95% confidence, if p value is less than 0.05, then reject null hypothesis, else accept null hypothesis.

sample1 = df_new["KM"]
sample2 = df_new["Fare"]

#Spearman Rank Correlation

from scipy.stats import spearmanr
corr, p = spearmanr(sample1, sample2)
                    
print('corr=%.3f, p=%.3f' % (corr,p))

if p < 0.05:
  print("Reject Ho and accept Ha")
else:
  print("Reject Ha and accept H0")

corr=1.000, p=0.000
Reject Ho and accept Ha

#Pearson Correlation

from scipy.stats import pearsonr
corr, p = pearsonr(sample1, sample2)
                    
print('corr=%.3f, p=%.3f' % (corr,p))

if p < 0.05:
  print("Reject Ho and accept Ha")
else:
  print("Reject Ha and accept H0")

corr=1.000, p=0.000
Reject Ho and accept Ha

sns.lineplot(data=df_new,x="KM",y="Fare")

<matplotlib.axes._subplots.AxesSubplot at 0x7f5abefe6050>

Both the correlation tests resulted a correlation value of 1. This means that "Distance Travelled" and "Fare" parameters are in complete positive linear relationship. Also, from the above test, Null hypothesis is rejected.

So, "Distance Travelled" and "Fare" are correlated.

1 b. Correlation between Distance Travelled and Time duration

In general, it is known that whenever distance increases, Time also inceases. In order to prove this, we are going to perform the collowing correlation tests:

Null Hypothesis: There's no relation between Distance Travelled and Time Duration

Alternate Hypothesis: There's a between Distance Travelled and Time Duration

Test condition: For a 95% confidence, if p value is less than 0.05, then reject null hypothesis, else accept null hypothesis.

sample_1 = df_new["KM"]
sample_2 = df_new["Duration_min"]

#Spearman Rank Correlation

from scipy.stats import spearmanr
corr, p = spearmanr(sample_1, sample_2)
                    
print('corr=%.3f, p=%.3f' % (corr,p))

if p < 0.05:
  print("Reject Ho and accept Ha")
else:
  print("Reject Ha and accept H0")

corr=0.784, p=0.000
Reject Ho and accept Ha

#Pearson Correlation

from scipy.stats import pearsonr
corr, p = pearsonr(sample_1, sample_2)
                    
print('corr=%.3f, p=%.3f' % (corr,p))

if p < 0.05:
  print("Reject Ho and accept Ha")
else:
  print("Reject Ha and accept H0")

corr=0.748, p=0.000
Reject Ho and accept Ha

sns.scatterplot(data=df_new, x="KM", y="Duration_min",color="blue")

<matplotlib.axes._subplots.AxesSubplot at 0x7f5abe5a5910>

The correlation values from both the above tests are greater than 0.05. Also, both tests results in the rejection of null hypothesis.

So, There's a positive relationship between "Distance Travelled" and "Time taken".

Conclusion:

Both the above tests for Distance Travelled vs Time duration as well as Distance Travelled vs Fare resulted in a positive linear relationship.

2. Uber got more revenue from Non - cash payments than from cash payment¶

In order to check whether there's an effect of method of payment an revenue, we perform One-way Anova test.

One way Anova Testing:

Null Hypothesis H0: There's no relation between type of payment and the revenue.

Alternate Hypothesis Ha: There's relation between type of payment and the revenue.

Significance value (alpha) = 0.05

data1 = df_new[["Payment_Type","Fare"]]
data1

#replacing character values to numeric values in Payment_type column
#Cash = 1, Card =2, Digital =3

data1['Payment_Type'].replace(['Cash', 'Card','Digital'],
                        [1,2,3], inplace=True)
data1

/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py:6619: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)

import statsmodels.api as sm
from statsmodels.formula.api import ols

new = ols('Fare ~ Payment_Type', data = data1).fit()

anovaRes = sm.stats.anova_lm(new,typ=2)
anovaRes['PR(>F)']

Payment_Type    0.944015
Residual             NaN
Name: PR(>F), dtype: float64

Here, the f value reopresents the variance of fare among 3 groups, and it shows that the variation is low.

Also, the p-value is very high = 0.944

0.944 > 0.05

Here, wew are accepting the null hypothesis, that the average revenue made by all the modes of payment are same.

Number of rides for modes of payment

x=sns.countplot(x=df_new['Payment_Type'])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

sns.set_palette("colorblind")
x=sns.countplot(x=df_new['Payment_Method'])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

Revenue based on different types of payment

plt.figure(figsize=(10,7))
sns.set_palette("Paired")
plot=sns.stripplot(y=df_new['Payment_Method'],x=df['Fare'])
plot.set(xlabel='Fare',ylabel= 'Type of Payment',title='Fare vs Payment type')
plt.show()

data = df_new.groupby("Payment_Method")["Fare"].sum()
data
#grouping the data based on the type of fares

Payment_Method
Card            100463.7
Cash             66279.9
Gift Card        47577.3
UPI/e-Wallet    101030.4
Uber Cash       115159.2
Name: Fare, dtype: float64

sns.set_palette("bright")
plt.figure(figsize=(10,8))
data.plot.pie(autopct="%.1f%%")

<matplotlib.axes._subplots.AxesSubplot at 0x7f5ab35df990>

It is found that Uber got more revenue from non-cash payments (84.6%) than cash payments (15.4%). Here Uber cash (Uber wallet payment) is the topmost contributor of payment mode followed by UPI payments.

Conclusion:

From the above analysis it is found that, the sum of revenue made by the digital payments is more than the card and cash payments.

But, proportionally, the number of digital payments made are higher comparatively to the other modes of payments.

This makes the average revenue made by different payment types are similar eventhough digital payments collected more revenue.

3. The Purpose of rides are having an effect on the days of the rides happened¶

In this test, we are about to determine whether there's any effect of weekdays parameter on the purpose of usage . ( Maybe weekends with more usage form Meal/Entertainment and weekdays with official activities).

Chi - Square Hypothesis Testing:

For the comparision of 2 categorical variables, we are going to perform chi-square test.

Null Hypothesis H0: There's no relation between Weekday and Purpose.

Alternate Hypothesis Ha: There's a relation between Weekday and Purpose.

Significance value (alpha): 0.05.

tab = pd.crosstab(df_new.Weekday, df_new.Purpose)   #creating a crosstab of 2 columns
tab

from scipy.stats import chi2_contingency
stat,p,dof,expected = chi2_contingency(tab)
print("stat=%.3f, p=%.3f" % (stat,p))

stat=47.538, p=0.720

if p < 0.05:
    print("Reject H0 and accept Ha")
else:
    print("Accept H0 and reject Ha")

Accept H0 and reject Ha

From the above, the stat value (difference between observed and expected values) is considerably larger and from the test results, it is clear that , there's no relationship between the Weekdays and the Purpose parameters.

Ride Category

sns.set_palette("Paired")
x=sns.countplot(x=df_new['Category'])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

It is clearly visible that more number of rides are made for business purposes than for personal.

Purpose of rides

plt.figure(figsize=(15,8))
x = sns.countplot(x=df_new['Purpose'])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

It is clearly evident that users use Uber mainly for Meeting and Entertainment activities than for other activities like Supplies, Customer Visit, Airport / Travel etc. Also they use the least for Charity and other commute activities.

4. Other data plots¶

I. Trips and Driver rating

freq_table = pd.crosstab(df_new['Driver_Rating'],'no_of_drivers')
freq_table

plt.figure(figsize=(10,8))
sns.set_palette("rocket")
x=sns.countplot(x=df_new['Driver_Rating'])
plt.ylabel('Number of Drivers')
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='white', size=12)

From the above plot, it is found that the number of trips with different driver rating lies bwtween 169 and 192. Around 180 trips falls under "1 star driver rating" and around 178 trips falls under "5 star driver rating". 171 trips weren't rated. Highest number of drivers falls under "3 Star rating" and lowest number of drivers falls under "2 star rating".

II. Month vs Number of Trips

sns.set_palette("Paired")
plt.figure(figsize=(15,8))
x=sns.countplot(x=df_new['Month'])
x.set(xlabel="Frequency",ylabel="Month")
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

From the plot, it is found that the month January records more number of trips and June records the less number of trips.

III. Day vs Number of Trips

plt.figure(figsize=(10,8))
order = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
x=sns.countplot(x=df_new["Weekday"], order=order)
x.set(xlabel="Frequency",ylabel="Weekday")
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

When aggredated based on weekdays, Sunday records the most number of trips. This might be due to the fact that people spend their weekend in travelling.Followed by Wednesday lies on the 2nd position. On the other hand, Monday records the least number of trips.

IV. Time vs Frequency

sns.set_palette("pastel")
plt.figure(figsize=(15,8))
x=sns.countplot(x=df_new['SHour'])
x.set(xlabel="Hour",ylabel="Frequency")
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

The above plot is showing almost the expected results:

From Midnight to Early morning, records the lowest number of requests.
It gradually increases around the morning peak hour - 9AM to 12 Noon
Likewise, the frequency gradually reduces over the evening peak hour from 17:00 hr towards midnight, it marks the least number of requests.
But unexpectedly, the number of requests peaked at 13:00 - 15:00 hr.

V. Number of trips completed

x=sns.countplot(x=df_new["Status"])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

Among the total trips, 932 were completed and 143 trips were cancelled, which is around 15.34%

VI. Revenue based on purpose of trips

purpose = df_new.groupby("Purpose")["Fare"].sum()
purpose

Purpose
Airport/Travel      17079.3
Between Offices      8366.4
Charity ($)           528.3
Commute              5976.6
Customer Visit      80691.9
Errand/Supplies     30021.9
Meal/Entertain     106197.0
Meeting            139326.3
Moving                994.5
Temporary Site      41328.3
Name: Fare, dtype: float64

plt.figure(figsize=(17,8))
x=sns.barplot(x="Purpose", y="Fare", data=df_new,ci = None, estimator=sum)

Based on purpose, Uber collected the maximum revenue in Meeting(140000), Meal/ Entertainment(105000) and Customer Visit(80000)

VII. PAIRPLOTS

sns.pairplot(df_new)

<seaborn.axisgrid.PairGrid at 0x7f5ab36c2dd0>

VIII. CORRELATION MATRIX

correlation=df_new.corr()

plt.figure(figsize=(15,15))
sns.heatmap(correlation,cmap="coolwarm",annot=True,vmax=.5,vmin=-.5,center=0,square=True,linewidths=.5)

<matplotlib.axes._subplots.AxesSubplot at 0x7f5ab2ef8b10>

From the above correlation heatmap,

Highly Correlated:

Positive correlation:
- Km vs Duration_min
- Duration_min vs Fare
- KM vs Fare
Negative correlation:
- Fare_per_km vs Tip

Moderately correlated

Positive correlation:
- KM vs Tip
- Duration_min vs Tip
- Fare vs Tip
Negative correlation:
- KM vs Fare_per_km
- Fare vs Fare_per_km

!sudo apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic

Reading package lists... Done
Building dependency tree       
Reading state information... Done
texlive-fonts-recommended is already the newest version (2017.20180305-1).
texlive-plain-generic is already the newest version (2017.20180305-2).
texlive-xetex is already the newest version (2017.20180305-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.

!jupyter nbconvert --to html /content/EDA_Uber.ipynb

[NbConvertApp] Converting notebook /content/EDA_Uber.ipynb to html
[NbConvertApp] Writing 1637947 bytes to /content/EDA_Uber.html

	SHour	Duration_min	KM	Driver_Rating	Fare	Tip	Date	Month	Year
count	1141.000000	1141.000000	1141.000000	1141.000000	1141.000000	1141.000000	1141.000000	1141.000000	1141.0
mean	14.656442	23.609991	10.632340	2.549518	380.867222	24.914110	15.858896	6.518843	2016.0
std	4.565827	30.137047	21.697991	1.697380	716.033709	21.243484	8.859244	3.527939	0.0
min	0.000000	0.000000	0.500000	0.000000	46.500000	0.000000	1.000000	1.000000	2016.0
25%	12.000000	10.000000	2.900000	1.000000	125.700000	10.000000	8.000000	3.000000	2016.0
50%	15.000000	16.000000	6.000000	3.000000	228.000000	20.000000	16.000000	7.000000	2016.0
75%	18.000000	28.000000	10.400000	4.000000	373.200000	35.000000	24.000000	10.000000	2016.0
max	23.000000	492.000000	310.300000	5.000000	10269.900000	127.000000	31.000000	12.000000	2016.0

	SHour	Duration_min	KM	Driver_Rating	Fare	Tip	Date	Month	Year	Fare_per_km
count	1075.000000	1075.000000	1075.000000	1075.000000	1075.000000	1075.000000	1075.000000	1075.000000	1075.0	1075.000000
mean	14.667907	24.682791	11.226512	2.533953	400.474884	26.208372	15.881860	6.523721	2016.0	39.487957
std	4.563890	30.720868	22.217617	1.696615	733.181364	21.203442	8.927809	3.539633	0.0	5.058784
min	0.000000	0.000000	1.400000	0.000000	76.200000	0.000000	1.000000	1.000000	2016.0	33.096681
25%	12.000000	11.000000	3.100000	1.000000	132.300000	11.000000	8.000000	3.000000	2016.0	35.752294
50%	15.000000	17.000000	6.200000	3.000000	234.600000	21.000000	16.000000	7.000000	2016.0	37.838710
75%	18.000000	29.000000	10.900000	4.000000	389.700000	36.000000	24.000000	10.000000	2016.0	42.677419
max	23.000000	492.000000	310.300000	5.000000	10269.900000	127.000000	31.000000	12.000000	2016.0	54.428571

	Start_date	Start_day	Start_time	SHour	End_date	End_day	End_time	Duration	Duration_min	Category	Pickup	Destination	KM	Purpose	Driver_Rating	Fare	Tip	Payment_Type	Payment_Method	Status
0	01-Jan-16	Friday	22:54:00	22	01-Jan-16	Friday	01:39:00	02:45:00	165	Business	Jacksonville	Kissimmee	201.0	Meeting	0	6663.0	0	Card	Card	Completed
1	01-Jan-16	Friday	13:42:00	13	01-Jan-16	Friday	13:54:00	00:12:00	12	Personal	Morrisville	Cary	4.4	Errand/Supplies	4	175.2	17	Digital	Uber Cash	Completed
2	01-Jan-16	Friday	22:46:00	22	01-Jan-16	Friday	22:58:00	00:12:00	12	Personal	Morrisville	Cary	3.1	Customer Visit	0	132.3	13	Digital	Gift Card	Completed
3	03-Jan-16	Sunday	12:57:00	12	03-Jan-16	Sunday	13:08:00	00:11:00	11	Business	Whitebridge	Edgehill Farms	4.3	Meal/Entertain	3	171.9	17	Card	Card	Completed
4	03-Jan-16	Sunday	20:24:00	20	03-Jan-16	Sunday	20:40:00	00:16:00	16	Business	Morrisville	Cary	3.1	Customer Visit	3	132.3	13	Digital	Uber Cash	Completed
5	04-Jan-16	Monday	15:34:00	15	04-Jan-16	Monday	15:50:00	00:16:00	16	Business	Cary	Cary	5.9	Meal/Entertain	1	224.7	22	Digital	Gift Card	Completed
6	05-Jan-16	Tuesday	18:39:00	18	05-Jan-16	Tuesday	18:55:00	00:16:00	16	Business	South Congress	The Drag	2.7	Errand/Supplies	2	119.1	11	Digital	UPI/e-Wallet	Completed
7	05-Jan-16	Tuesday	12:29:00	12	05-Jan-16	Tuesday	12:42:00	00:13:00	13	Personal	Kissimmee	Orlando	3.6	Customer Visit	4	148.8	14	Card	Card	Completed
8	05-Jan-16	Tuesday	13:28:00	13	05-Jan-16	Tuesday	13:46:00	00:18:00	18	Business	Lexington Park at Amberly	Westpark Place	1.9	Meeting	5	92.7	9	Card	Card	Completed
9	06-Jan-16	Wednesday	13:42:00	13	06-Jan-16	Wednesday	14:01:00	00:19:00	19	Business	Morrisville	Cary	6.5	Meal/Entertain	4	244.5	24	Card	Card	Completed

	Start_date	Start_day	Start_time	SHour	End_date	End_day	End_time	Duration	Duration_min	Category	...	Driver_Rating	Fare	Tip	Payment_Type	Payment_Method	Status	Date	Month	Year	Weekday
30	2016-01-11	Monday	14:26:00	14	11-Jan-16	Monday	14:31:00	00:05:00	5	Business	...	1	105.9	10	Digital	UPI/e-Wallet	Completed	11	1	2016	Monday
62	2016-01-20	Wednesday	17:30:00	17	20-Jan-16	Wednesday	17:37:00	00:07:00	7	Business	...	3	89.4	8	Cash	Cash	Completed	20	1	2016	Wednesday
121	2016-02-04	Thursday	23:45:00	23	04-Feb-16	Thursday	00:01:00	00:16:00	16	Personal	...	2	119.1	11	Card	Card	Completed	4	2	2016	Thursday
166	2016-02-17	Wednesday	20:53:00	20	17-Feb-16	Wednesday	21:42:00	00:49:00	49	Business	...	3	241.2	24	Cash	Cash	Completed	17	2	2016	Wednesday
415	2016-05-07	Saturday	18:42:00	18	07-May-16	Saturday	18:56:00	00:14:00	14	Business	...	2	142.2	14	Digital	UPI/e-Wallet	Completed	7	5	2016	Saturday
434	2016-05-14	Saturday	17:50:00	17	14-May-16	Saturday	17:59:00	00:09:00	9	Business	...	4	89.4	8	Card	Card	Completed	14	5	2016	Saturday
442	2016-05-16	Monday	20:47:00	20	16-May-16	Monday	20:53:00	00:06:00	6	Business	...	1	69.6	6	Card	Card	Completed	16	5	2016	Monday
500	2016-06-05	Sunday	12:03:00	12	05-Jun-16	Sunday	12:21:00	00:18:00	18	Business	...	0	353.4	35	Digital	Uber Cash	Completed	5	6	2016	Sunday
566	2016-07-01	Friday	17:02:00	17	01-Jul-16	Friday	17:27:00	00:25:00	25	Business	...	5	492.0	49	Card	Card	Completed	1	7	2016	Friday
595	2016-07-11	Monday	22:16:00	22	11-Jul-16	Monday	22:34:00	00:18:00	18	Business	...	1	254.4	0	Digital	UPI/e-Wallet	Completed	11	7	2016	Monday
621	2016-07-18	Monday	00:50:00	0	18-Jul-16	Monday	01:00:00	00:10:00	10	Personal	...	1	89.4	8	Digital	UPI/e-Wallet	Completed	18	7	2016	Monday
624	2016-07-19	Tuesday	17:50:00	17	19-Jul-16	Tuesday	17:56:00	00:06:00	6	Business	...	1	86.1	8	Digital	Uber Cash	Completed	19	7	2016	Tuesday
970	2016-11-07	Monday	12:47:00	12	07-Nov-16	Monday	13:04:00	00:17:00	17	Business	...	1	234.6	23	Card	Card	Completed	7	11	2016	Monday
1102	2016-12-15	Thursday	21:16:00	21	15-Dec-16	Thursday	21:41:00	00:25:00	25	Business	...	2	468.9	46	Digital	UPI/e-Wallet	Completed	15	12	2016	Thursday

	SHour	Duration_min	KM	Driver_Rating	Fare	Tip	Date	Month	Year	Fare_per_km
count	-66.000000	-66.000000	-66.000000	-66.000000	-66.000000	-66.000000	-66.000000	-66.000000	-66.0	-66.000000
mean	0.011465	1.072799	0.594172	-0.015564	19.607662	1.294262	0.022965	0.004878	0.0	-1.574398
std	-0.001937	0.583821	0.519626	-0.000765	17.147654	-0.040041	0.068564	0.011693	0.0	-3.304864
min	0.000000	0.000000	0.900000	0.000000	29.700000	0.000000	0.000000	0.000000	0.0	0.000000
25%	0.000000	1.000000	0.200000	0.000000	6.600000	1.000000	0.000000	0.000000	0.0	-0.132322
50%	0.000000	1.000000	0.200000	0.000000	6.600000	1.000000	0.000000	0.000000	0.0	-0.161290
75%	0.000000	1.000000	0.500000	0.000000	16.500000	1.000000	0.000000	0.000000	0.0	-0.667408
max	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	-38.571429

Purpose	Airport/Travel	Between Offices	Charity ($)	Commute	Customer Visit	Errand/Supplies	Meal/Entertain	Meeting	Moving	Temporary Site
Weekday
Friday	5	2	0	0	16	21	45	51	0	15
Monday	8	5	0	0	15	17	49	32	0	8
Saturday	7	6	0	0	20	18	38	45	1	14
Sunday	3	5	0	0	24	27	46	46	0	17
Thursday	3	2	0	0	21	19	39	48	0	16
Tuesday	10	3	0	0	18	23	39	44	2	18
Wednesday	3	2	1	1	19	24	51	43	1	19

col_0	no_of_drivers
Driver_Rating
0	171
1	180
2	169
3	192
4	185
5	178