STUDY ON UBER DATASET

Objectives of the study

  1. To make a basic EDA on the data
  2. To study and visualise different parameters with different hypothesis testing.
  3. To find the relationship between different parameters
In [1]:
%matplotlib inline

import datetime as dt
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
In [2]:
var = pd.read_csv('/content/Uber project dataset.csv')
In [3]:
var.head(10)
Out[3]:
Start_date Start_day Start_time SHour End_date End_day End_time Duration Duration_min Category Pickup Destination KM Purpose Driver_Rating Fare Tip Payment_Type Payment_Method Status
0 01-Jan-16 Friday 22:54:00 22 01-Jan-16 Friday 01:39:00 02:45:00 165 Business Jacksonville Kissimmee 201.0 Meeting 0 6663.0 0 Card Card Completed
1 01-Jan-16 Friday 13:42:00 13 01-Jan-16 Friday 13:54:00 00:12:00 12 Personal Morrisville Cary 4.4 Errand/Supplies 4 175.2 17 Digital Uber Cash Completed
2 01-Jan-16 Friday 22:46:00 22 01-Jan-16 Friday 22:58:00 00:12:00 12 Personal Morrisville Cary 3.1 Customer Visit 0 132.3 13 Digital Gift Card Completed
3 03-Jan-16 Sunday 12:57:00 12 03-Jan-16 Sunday 13:08:00 00:11:00 11 Business Whitebridge Edgehill Farms 4.3 Meal/Entertain 3 171.9 17 Card Card Completed
4 03-Jan-16 Sunday 20:24:00 20 03-Jan-16 Sunday 20:40:00 00:16:00 16 Business Morrisville Cary 3.1 Customer Visit 3 132.3 13 Digital Uber Cash Completed
5 04-Jan-16 Monday 15:34:00 15 04-Jan-16 Monday 15:50:00 00:16:00 16 Business Cary Cary 5.9 Meal/Entertain 1 224.7 22 Digital Gift Card Completed
6 05-Jan-16 Tuesday 18:39:00 18 05-Jan-16 Tuesday 18:55:00 00:16:00 16 Business South Congress The Drag 2.7 Errand/Supplies 2 119.1 11 Digital UPI/e-Wallet Completed
7 05-Jan-16 Tuesday 12:29:00 12 05-Jan-16 Tuesday 12:42:00 00:13:00 13 Personal Kissimmee Orlando 3.6 Customer Visit 4 148.8 14 Card Card Completed
8 05-Jan-16 Tuesday 13:28:00 13 05-Jan-16 Tuesday 13:46:00 00:18:00 18 Business Lexington Park at Amberly Westpark Place 1.9 Meeting 5 92.7 9 Card Card Completed
9 06-Jan-16 Wednesday 13:42:00 13 06-Jan-16 Wednesday 14:01:00 00:19:00 19 Business Morrisville Cary 6.5 Meal/Entertain 4 244.5 24 Card Card Completed
In [4]:
var.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1155 entries, 0 to 1154
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Start_date      1155 non-null   object 
 1   Start_day       1155 non-null   object 
 2   Start_time      1155 non-null   object 
 3   SHour           1155 non-null   int64  
 4   End_date        1155 non-null   object 
 5   End_day         1155 non-null   object 
 6   End_time        1155 non-null   object 
 7   Duration        1155 non-null   object 
 8   Duration_min    1155 non-null   int64  
 9   Category        1155 non-null   object 
 10  Pickup          1155 non-null   object 
 11  Destination     1155 non-null   object 
 12  KM              1155 non-null   float64
 13  Purpose         1141 non-null   object 
 14  Driver_Rating   1155 non-null   int64  
 15  Fare            1155 non-null   float64
 16  Tip             1155 non-null   int64  
 17  Payment_Type    1155 non-null   object 
 18  Payment_Method  1155 non-null   object 
 19  Status          1155 non-null   object 
dtypes: float64(2), int64(4), object(14)
memory usage: 180.6+ KB
In [5]:
print("Prints all the columns that are of numerical type")
var.select_dtypes(include=['number'])
Prints all the columns that are of numerical type
Out[5]:
SHour Duration_min KM Driver_Rating Fare Tip
0 22 165 201.0 0 6663.0 0
1 13 12 4.4 4 175.2 17
2 22 12 3.1 0 132.3 13
3 12 11 4.3 3 171.9 17
4 20 16 3.1 3 132.3 13
... ... ... ... ... ... ...
1150 20 9 3.0 4 129.0 12
1151 16 12 5.7 1 218.1 21
1152 13 19 10.4 5 373.2 37
1153 19 9 2.2 0 102.6 10
1154 19 9 2.9 5 125.7 12

1155 rows × 6 columns

In [6]:
print("Prints all the columns that are of non-numerical type")
var.select_dtypes(include=['object'])
Prints all the columns that are of non-numerical type
Out[6]:
Start_date Start_day Start_time End_date End_day End_time Duration Category Pickup Destination Purpose Payment_Type Payment_Method Status
0 01-Jan-16 Friday 22:54:00 01-Jan-16 Friday 01:39:00 02:45:00 Business Jacksonville Kissimmee Meeting Card Card Completed
1 01-Jan-16 Friday 13:42:00 01-Jan-16 Friday 13:54:00 00:12:00 Personal Morrisville Cary Errand/Supplies Digital Uber Cash Completed
2 01-Jan-16 Friday 22:46:00 01-Jan-16 Friday 22:58:00 00:12:00 Personal Morrisville Cary Customer Visit Digital Gift Card Completed
3 03-Jan-16 Sunday 12:57:00 03-Jan-16 Sunday 13:08:00 00:11:00 Business Whitebridge Edgehill Farms Meal/Entertain Card Card Completed
4 03-Jan-16 Sunday 20:24:00 03-Jan-16 Sunday 20:40:00 00:16:00 Business Morrisville Cary Customer Visit Digital Uber Cash Completed
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1150 30-Dec-16 Friday 20:48:00 30-Dec-16 Friday 20:57:00 00:09:00 Business Morrisville Cary Customer Visit Digital Uber Cash Completed
1151 31-Dec-16 Saturday 16:21:00 31-Dec-16 Saturday 16:33:00 00:12:00 Business Cary Apex Errand/Supplies Digital Uber Cash Completed
1152 31-Dec-16 Saturday 13:22:00 31-Dec-16 Saturday 13:41:00 00:19:00 Business Durham Cary Meeting Digital UPI/e-Wallet Completed
1153 31-Dec-16 Saturday 19:41:00 31-Dec-16 Saturday 19:50:00 00:09:00 Business Kenner Kenner Meeting Digital Uber Cash Completed
1154 31-Dec-16 Saturday 19:30:00 31-Dec-16 Saturday 19:39:00 00:09:00 Business Morrisville Cary Errand/Supplies Card Card Completed

1155 rows × 14 columns

Extracting date parameters - Date, Day, Month, Year from the date column

In [7]:
var['Start_date']=pd.to_datetime(var['Start_date'])
In [8]:
var['Date'] = var['Start_date'].dt.day
var['Month'] = var['Start_date'].dt.month
var['Year'] = var['Start_date'].dt.year
var['Weekday']=var['Start_date'].dt.day_name()
In [9]:
print(var)
     Start_date Start_day Start_time  SHour   End_date   End_day  End_time  \
0    2016-01-01    Friday   22:54:00     22  01-Jan-16    Friday  01:39:00   
1    2016-01-01    Friday   13:42:00     13  01-Jan-16    Friday  13:54:00   
2    2016-01-01    Friday   22:46:00     22  01-Jan-16    Friday  22:58:00   
3    2016-01-03    Sunday   12:57:00     12  03-Jan-16    Sunday  13:08:00   
4    2016-01-03    Sunday   20:24:00     20  03-Jan-16    Sunday  20:40:00   
...         ...       ...        ...    ...        ...       ...       ...   
1150 2016-12-30    Friday   20:48:00     20  30-Dec-16    Friday  20:57:00   
1151 2016-12-31  Saturday   16:21:00     16  31-Dec-16  Saturday  16:33:00   
1152 2016-12-31  Saturday   13:22:00     13  31-Dec-16  Saturday  13:41:00   
1153 2016-12-31  Saturday   19:41:00     19  31-Dec-16  Saturday  19:50:00   
1154 2016-12-31  Saturday   19:30:00     19  31-Dec-16  Saturday  19:39:00   

      Duration  Duration_min  Category  ... Driver_Rating    Fare  Tip  \
0     02:45:00           165  Business  ...             0  6663.0    0   
1     00:12:00            12  Personal  ...             4   175.2   17   
2     00:12:00            12  Personal  ...             0   132.3   13   
3     00:11:00            11  Business  ...             3   171.9   17   
4     00:16:00            16  Business  ...             3   132.3   13   
...        ...           ...       ...  ...           ...     ...  ...   
1150  00:09:00             9  Business  ...             4   129.0   12   
1151  00:12:00            12  Business  ...             1   218.1   21   
1152  00:19:00            19  Business  ...             5   373.2   37   
1153  00:09:00             9  Business  ...             0   102.6   10   
1154  00:09:00             9  Business  ...             5   125.7   12   

     Payment_Type  Payment_Method     Status  Date Month  Year   Weekday  
0            Card            Card  Completed     1     1  2016    Friday  
1         Digital       Uber Cash  Completed     1     1  2016    Friday  
2         Digital       Gift Card  Completed     1     1  2016    Friday  
3            Card            Card  Completed     3     1  2016    Sunday  
4         Digital       Uber Cash  Completed     3     1  2016    Sunday  
...           ...             ...        ...   ...   ...   ...       ...  
1150      Digital       Uber Cash  Completed    30    12  2016    Friday  
1151      Digital       Uber Cash  Completed    31    12  2016  Saturday  
1152      Digital    UPI/e-Wallet  Completed    31    12  2016  Saturday  
1153      Digital       Uber Cash  Completed    31    12  2016  Saturday  
1154         Card            Card  Completed    31    12  2016  Saturday  

[1155 rows x 24 columns]
In [10]:
[var.isna().any(axis=1)]
Out[10]:
[0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1150    False
 1151    False
 1152    False
 1153    False
 1154    False
 Length: 1155, dtype: bool]
In [11]:
var[var.isna().any(axis=1)]
Out[11]:
Start_date Start_day Start_time SHour End_date End_day End_time Duration Duration_min Category ... Driver_Rating Fare Tip Payment_Type Payment_Method Status Date Month Year Weekday
30 2016-01-11 Monday 14:26:00 14 11-Jan-16 Monday 14:31:00 00:05:00 5 Business ... 1 105.9 10 Digital UPI/e-Wallet Completed 11 1 2016 Monday
62 2016-01-20 Wednesday 17:30:00 17 20-Jan-16 Wednesday 17:37:00 00:07:00 7 Business ... 3 89.4 8 Cash Cash Completed 20 1 2016 Wednesday
121 2016-02-04 Thursday 23:45:00 23 04-Feb-16 Thursday 00:01:00 00:16:00 16 Personal ... 2 119.1 11 Card Card Completed 4 2 2016 Thursday
166 2016-02-17 Wednesday 20:53:00 20 17-Feb-16 Wednesday 21:42:00 00:49:00 49 Business ... 3 241.2 24 Cash Cash Completed 17 2 2016 Wednesday
415 2016-05-07 Saturday 18:42:00 18 07-May-16 Saturday 18:56:00 00:14:00 14 Business ... 2 142.2 14 Digital UPI/e-Wallet Completed 7 5 2016 Saturday
434 2016-05-14 Saturday 17:50:00 17 14-May-16 Saturday 17:59:00 00:09:00 9 Business ... 4 89.4 8 Card Card Completed 14 5 2016 Saturday
442 2016-05-16 Monday 20:47:00 20 16-May-16 Monday 20:53:00 00:06:00 6 Business ... 1 69.6 6 Card Card Completed 16 5 2016 Monday
500 2016-06-05 Sunday 12:03:00 12 05-Jun-16 Sunday 12:21:00 00:18:00 18 Business ... 0 353.4 35 Digital Uber Cash Completed 5 6 2016 Sunday
566 2016-07-01 Friday 17:02:00 17 01-Jul-16 Friday 17:27:00 00:25:00 25 Business ... 5 492.0 49 Card Card Completed 1 7 2016 Friday
595 2016-07-11 Monday 22:16:00 22 11-Jul-16 Monday 22:34:00 00:18:00 18 Business ... 1 254.4 0 Digital UPI/e-Wallet Completed 11 7 2016 Monday
621 2016-07-18 Monday 00:50:00 0 18-Jul-16 Monday 01:00:00 00:10:00 10 Personal ... 1 89.4 8 Digital UPI/e-Wallet Completed 18 7 2016 Monday
624 2016-07-19 Tuesday 17:50:00 17 19-Jul-16 Tuesday 17:56:00 00:06:00 6 Business ... 1 86.1 8 Digital Uber Cash Completed 19 7 2016 Tuesday
970 2016-11-07 Monday 12:47:00 12 07-Nov-16 Monday 13:04:00 00:17:00 17 Business ... 1 234.6 23 Card Card Completed 7 11 2016 Monday
1102 2016-12-15 Thursday 21:16:00 21 15-Dec-16 Thursday 21:41:00 00:25:00 25 Business ... 2 468.9 46 Digital UPI/e-Wallet Completed 15 12 2016 Thursday

14 rows × 24 columns

In [12]:
var.dropna(axis=0, how="any", inplace=True)
var.shape
Out[12]:
(1141, 24)

Creating a pandas dataframe

In [13]:
df=pd.DataFrame(var)
print(df)
     Start_date Start_day Start_time  SHour   End_date   End_day  End_time  \
0    2016-01-01    Friday   22:54:00     22  01-Jan-16    Friday  01:39:00   
1    2016-01-01    Friday   13:42:00     13  01-Jan-16    Friday  13:54:00   
2    2016-01-01    Friday   22:46:00     22  01-Jan-16    Friday  22:58:00   
3    2016-01-03    Sunday   12:57:00     12  03-Jan-16    Sunday  13:08:00   
4    2016-01-03    Sunday   20:24:00     20  03-Jan-16    Sunday  20:40:00   
...         ...       ...        ...    ...        ...       ...       ...   
1150 2016-12-30    Friday   20:48:00     20  30-Dec-16    Friday  20:57:00   
1151 2016-12-31  Saturday   16:21:00     16  31-Dec-16  Saturday  16:33:00   
1152 2016-12-31  Saturday   13:22:00     13  31-Dec-16  Saturday  13:41:00   
1153 2016-12-31  Saturday   19:41:00     19  31-Dec-16  Saturday  19:50:00   
1154 2016-12-31  Saturday   19:30:00     19  31-Dec-16  Saturday  19:39:00   

      Duration  Duration_min  Category  ... Driver_Rating    Fare  Tip  \
0     02:45:00           165  Business  ...             0  6663.0    0   
1     00:12:00            12  Personal  ...             4   175.2   17   
2     00:12:00            12  Personal  ...             0   132.3   13   
3     00:11:00            11  Business  ...             3   171.9   17   
4     00:16:00            16  Business  ...             3   132.3   13   
...        ...           ...       ...  ...           ...     ...  ...   
1150  00:09:00             9  Business  ...             4   129.0   12   
1151  00:12:00            12  Business  ...             1   218.1   21   
1152  00:19:00            19  Business  ...             5   373.2   37   
1153  00:09:00             9  Business  ...             0   102.6   10   
1154  00:09:00             9  Business  ...             5   125.7   12   

     Payment_Type  Payment_Method     Status  Date Month  Year   Weekday  
0            Card            Card  Completed     1     1  2016    Friday  
1         Digital       Uber Cash  Completed     1     1  2016    Friday  
2         Digital       Gift Card  Completed     1     1  2016    Friday  
3            Card            Card  Completed     3     1  2016    Sunday  
4         Digital       Uber Cash  Completed     3     1  2016    Sunday  
...           ...             ...        ...   ...   ...   ...       ...  
1150      Digital       Uber Cash  Completed    30    12  2016    Friday  
1151      Digital       Uber Cash  Completed    31    12  2016  Saturday  
1152      Digital    UPI/e-Wallet  Completed    31    12  2016  Saturday  
1153      Digital       Uber Cash  Completed    31    12  2016  Saturday  
1154         Card            Card  Completed    31    12  2016  Saturday  

[1141 rows x 24 columns]
In [14]:
df.nunique()
Out[14]:
Start_date        352
Start_day           7
Start_time        685
SHour              23
End_date          352
End_day             7
End_time          675
Duration           99
Duration_min       99
Category            2
Pickup            175
Destination       183
KM                256
Purpose            10
Driver_Rating       6
Fare              256
Tip                86
Payment_Type        3
Payment_Method      5
Status              2
Date               31
Month              12
Year                1
Weekday             7
dtype: int64
In [15]:
df.describe()
Out[15]:
SHour Duration_min KM Driver_Rating Fare Tip Date Month Year
count 1141.000000 1141.000000 1141.000000 1141.000000 1141.000000 1141.000000 1141.000000 1141.000000 1141.0
mean 14.656442 23.609991 10.632340 2.549518 380.867222 24.914110 15.858896 6.518843 2016.0
std 4.565827 30.137047 21.697991 1.697380 716.033709 21.243484 8.859244 3.527939 0.0
min 0.000000 0.000000 0.500000 0.000000 46.500000 0.000000 1.000000 1.000000 2016.0
25% 12.000000 10.000000 2.900000 1.000000 125.700000 10.000000 8.000000 3.000000 2016.0
50% 15.000000 16.000000 6.000000 3.000000 228.000000 20.000000 16.000000 7.000000 2016.0
75% 18.000000 28.000000 10.400000 4.000000 373.200000 35.000000 24.000000 10.000000 2016.0
max 23.000000 492.000000 310.300000 5.000000 10269.900000 127.000000 31.000000 12.000000 2016.0

Findings about the parameters: Fare and Distance Travelled

In [16]:
plt.figure(figsize=(10,8))
plt.hist(df['Fare'],bins=50)
Out[16]:
(array([629., 305., 119.,  24.,  13.,   9.,   7.,   8.,   1.,   1.,   2.,
          1.,   3.,   1.,   1.,   1.,   1.,   1.,   1.,   0.,   0.,   2.,
          0.,   2.,   0.,   2.,   0.,   0.,   1.,   1.,   0.,   3.,   1.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   1.]),
 array([   46.5  ,   250.968,   455.436,   659.904,   864.372,  1068.84 ,
         1273.308,  1477.776,  1682.244,  1886.712,  2091.18 ,  2295.648,
         2500.116,  2704.584,  2909.052,  3113.52 ,  3317.988,  3522.456,
         3726.924,  3931.392,  4135.86 ,  4340.328,  4544.796,  4749.264,
         4953.732,  5158.2  ,  5362.668,  5567.136,  5771.604,  5976.072,
         6180.54 ,  6385.008,  6589.476,  6793.944,  6998.412,  7202.88 ,
         7407.348,  7611.816,  7816.284,  8020.752,  8225.22 ,  8429.688,
         8634.156,  8838.624,  9043.092,  9247.56 ,  9452.028,  9656.496,
         9860.964, 10065.432, 10269.9  ]),
 <a list of 50 Patch objects>)
In [17]:
plt.figure(figsize=(10,8))
plt.hist(df['KM'], bins=50)
Out[17]:
(array([629., 305., 119.,  24.,  13.,   9.,   7.,   8.,   1.,   1.,   2.,
          1.,   3.,   1.,   1.,   1.,   1.,   1.,   1.,   0.,   0.,   2.,
          0.,   2.,   0.,   2.,   0.,   0.,   1.,   1.,   0.,   3.,   1.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   1.]),
 array([  0.5  ,   6.696,  12.892,  19.088,  25.284,  31.48 ,  37.676,
         43.872,  50.068,  56.264,  62.46 ,  68.656,  74.852,  81.048,
         87.244,  93.44 ,  99.636, 105.832, 112.028, 118.224, 124.42 ,
        130.616, 136.812, 143.008, 149.204, 155.4  , 161.596, 167.792,
        173.988, 180.184, 186.38 , 192.576, 198.772, 204.968, 211.164,
        217.36 , 223.556, 229.752, 235.948, 242.144, 248.34 , 254.536,
        260.732, 266.928, 273.124, 279.32 , 285.516, 291.712, 297.908,
        304.104, 310.3  ]),
 <a list of 50 Patch objects>)

From the above histograms, most of the fare amount falls between 0 - 2000 range and most of the distance travelled is shorter comparatively (0-50 Km).It is also found that the parameters "Kms travelled" and "Fare" are not normally distributed. In order to prove this statistically,we perform the following test:

Shapiro test for normality testing

In [18]:
from scipy.stats import shapiro

norm=df['Fare']
stat, p = shapiro(norm)

print('stat=%.2f,p=%.30f'% (stat,p))

if p>0.05:
  print("Fare data normally distributed")
else:
  print("Fare data not normally distributed")
stat=0.35,p=0.000000000000000000000000000000
Fare data not normally distributed
In [19]:
norm=df['KM']
stat, p = shapiro(norm)

print('stat=%.2f,p=%.30f'% (stat,p))

if p>0.05:
  print("Km data normally distributed")
else:
  print("Km data not normally distributed")
stat=0.35,p=0.000000000000000000000000000000
Km data not normally distributed

So, it is statistically proved that the considered data are not normally distributed by considering a 95% confidence.

Outliers of fare and distance parameters

From the previous section, it is found that the parameters "KM" and "Fare" are not normally distributed. So we can't able to perform z test for outlier testing. We can work with the outliers using the Five- Point Data Desctiption to work with outliers.

At first, we are creating Box plots for the parameters to visualize the outliers.

In [20]:
plt.figure(figsize=(10,6))
sns.boxplot(x=df["Fare"])
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5abe8d3a10>
In [21]:
plt.figure(figsize=(10,6))
sns.boxplot(x=df["KM"])
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5abee86290>

In this case, removing all the outliers of Fare and Distance parameters is not considerable, since they are the parameters that are eventual to happen. Instead, it is better to calculate a new parameter, Fare per km and removing the outliers based on that would be considerable, since the outliers in this case might be due to unexpected surge.

Price/km distribution

In [22]:
df["Fare_per_km"] = df["Fare"] / df["KM"]
plt.hist(df["Fare_per_km"])
Out[22]:
(array([653., 262., 121.,  47.,  28.,   8.,   6.,   8.,   5.,   3.]),
 array([33.09668063, 39.08701257, 45.07734451, 51.06767644, 57.05800838,
        63.04834032, 69.03867225, 75.02900419, 81.01933613, 87.00966806,
        93.        ]),
 <a list of 10 Patch objects>)
In [23]:
sns.boxplot(df["Fare_per_km"])
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5abe6fb5d0>
In [24]:
df["Fare_per_km"].describe()
Out[24]:
count    1141.000000
mean       41.062355
std         8.363648
min        33.096681
25%        35.884615
50%        38.000000
75%        43.344828
max        93.000000
Name: Fare_per_km, dtype: float64

Finding the quantiles of price/km parameter

In [25]:
def outer_boundary(data, variable):
  Q1 = data[variable].quantile(.25)
  Q3 = data[variable].quantile(.75)
  IQR = Q3 - Q1
  low_bound = Q1 - 1.5*IQR
  up_bound = Q3 + 1.5*IQR

  return low_bound, up_bound
In [26]:
low, up = outer_boundary(df,"Fare_per_km")
print("Lower limit = ", low)
print("Upper limit = ", up)
Lower limit =  24.694297082228104
Upper limit =  54.53514588859417

The data falling away from the above boundaries are outliers. Here the values greater than 54.535 are outliers.

Removal of outliers with reference to the lower and upper limit of the 5 point data description

In [27]:
df_outliers = np.where(df["Fare_per_km"] > up,True,np.where(df["Fare_per_km"] < low,True,False))
In [28]:
df.shape
Out[28]:
(1141, 25)
In [29]:
df_new = df.loc[~(df_outliers),]
In [30]:
df_new.shape
Out[30]:
(1075, 25)
In [31]:
df_new["Fare_per_km"].max()
Out[31]:
54.42857142857143
In [32]:
df_new["Fare_per_km"].min()
Out[32]:
33.09668063164679
In [33]:
df_new.describe()
Out[33]:
SHour Duration_min KM Driver_Rating Fare Tip Date Month Year Fare_per_km
count 1075.000000 1075.000000 1075.000000 1075.000000 1075.000000 1075.000000 1075.000000 1075.000000 1075.0 1075.000000
mean 14.667907 24.682791 11.226512 2.533953 400.474884 26.208372 15.881860 6.523721 2016.0 39.487957
std 4.563890 30.720868 22.217617 1.696615 733.181364 21.203442 8.927809 3.539633 0.0 5.058784
min 0.000000 0.000000 1.400000 0.000000 76.200000 0.000000 1.000000 1.000000 2016.0 33.096681
25% 12.000000 11.000000 3.100000 1.000000 132.300000 11.000000 8.000000 3.000000 2016.0 35.752294
50% 15.000000 17.000000 6.200000 3.000000 234.600000 21.000000 16.000000 7.000000 2016.0 37.838710
75% 18.000000 29.000000 10.900000 4.000000 389.700000 36.000000 24.000000 10.000000 2016.0 42.677419
max 23.000000 492.000000 310.300000 5.000000 10269.900000 127.000000 31.000000 12.000000 2016.0 54.428571
In [34]:
sns.boxplot(y=df_new["Fare_per_km"])  #The outliers correponding to Fare per km of the dataset are removed and the new dataset aligns with the new IQR, Q1 and Q3 values. In this case, the below plot showing some outliers are specific to this updated dataset are considerable.
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5abe6cfa10>
In [35]:
df_new.describe() - df.describe()
Out[35]:
SHour Duration_min KM Driver_Rating Fare Tip Date Month Year Fare_per_km
count -66.000000 -66.000000 -66.000000 -66.000000 -66.000000 -66.000000 -66.000000 -66.000000 -66.0 -66.000000
mean 0.011465 1.072799 0.594172 -0.015564 19.607662 1.294262 0.022965 0.004878 0.0 -1.574398
std -0.001937 0.583821 0.519626 -0.000765 17.147654 -0.040041 0.068564 0.011693 0.0 -3.304864
min 0.000000 0.000000 0.900000 0.000000 29.700000 0.000000 0.000000 0.000000 0.0 0.000000
25% 0.000000 1.000000 0.200000 0.000000 6.600000 1.000000 0.000000 0.000000 0.0 -0.132322
50% 0.000000 1.000000 0.200000 0.000000 6.600000 1.000000 0.000000 0.000000 0.0 -0.161290
75% 0.000000 1.000000 0.500000 0.000000 16.500000 1.000000 0.000000 0.000000 0.0 -0.667408
max 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 -38.571429

From the above steps, the outliers based on fare per km are removed, due to which the 5 Point data description aligned with the new dataset, which produced some outliers for the fare per km value, but those values are considerable.

Also the difference in the 5 point data description before and after removing the outliers also presented.

In [36]:
df_new.nunique()
Out[36]:
Start_date        350
Start_day           7
Start_time        663
SHour              23
End_date          350
End_day             7
End_time          653
Duration           99
Duration_min       99
Category            2
Pickup            160
Destination       167
KM                247
Purpose            10
Driver_Rating       6
Fare              247
Tip                83
Payment_Type        3
Payment_Method      5
Status              2
Date               31
Month              12
Year                1
Weekday             7
Fare_per_km       247
dtype: int64

Max and Min Fare details:

In [37]:
print("Max Fare details:")
print(" ")
print("Max Fare:",df_new.Fare[df_new.Fare==df_new.Fare.max()])
print("KM for Max Fare:",df_new.KM[df_new.Fare==df_new.Fare.max()])
print("Date:",df_new.Start_date[df_new.Fare==df_new.Fare.max()])
print("Date:",df_new.Weekday[df_new.Fare==df_new.Fare.max()])
Max Fare details:
 
Max Fare: 131    10269.9
Name: Fare, dtype: float64
KM for Max Fare: 131    310.3
Name: KM, dtype: float64
Date: 131   2016-02-06
Name: Start_date, dtype: datetime64[ns]
Date: 131    Saturday
Name: Weekday, dtype: object
In [38]:
print("Min Fare details:")
print(" ")
print("Min Fare:",df_new.Fare[df_new.Fare==df_new.Fare.min()])
print("KM for Min Fare:",df_new.KM[df_new.Fare==df_new.Fare.min()])
print("Date:",df_new.Start_date[df_new.Fare==df_new.Fare.min()])
print("Date:",df_new.Weekday[df_new.Fare==df_new.Fare.min()])
Min Fare details:
 
Min Fare: 11      76.2
57      76.2
67      76.2
301     76.2
317     76.2
337     76.2
388     76.2
513     76.2
559     76.2
583     76.2
930     76.2
1039    76.2
1047    76.2
1050    76.2
Name: Fare, dtype: float64
KM for Min Fare: 11      1.4
57      1.4
67      1.4
301     1.4
317     1.4
337     1.4
388     1.4
513     1.4
559     1.4
583     1.4
930     1.4
1039    1.4
1047    1.4
1050    1.4
Name: KM, dtype: float64
Date: 11     2016-01-06
57     2016-01-19
67     2016-01-22
301    2016-03-31
317    2016-04-06
337    2016-04-12
388    2016-04-28
513    2016-06-09
559    2016-06-29
583    2016-07-05
930    2016-10-22
1039   2016-11-29
1047   2016-11-30
1050   2016-11-30
Name: Start_date, dtype: datetime64[ns]
Date: 11      Wednesday
57        Tuesday
67         Friday
301      Thursday
317     Wednesday
337       Tuesday
388      Thursday
513      Thursday
559     Wednesday
583       Tuesday
930      Saturday
1039      Tuesday
1047    Wednesday
1050    Wednesday
Name: Weekday, dtype: object

STUDY OF RELATIONSHIP BETWEEN DIFFERENT PARAMETERS

1 a. Correlation between Distance Travelled and Fare

In general, it is known that whenever distance increases, Fare also inceases. In order to prove this, we are going to perform the collowing correlation tests:

Null Hypothesis: There's no correlation between Distance Travelled and Fare.

Alternate Hypothesis: There's a correlation between Distance Travelled and Fare.

Test condition: For a 95% confidence, if p value is less than 0.05, then reject null hypothesis, else accept null hypothesis.

In [39]:
sample1 = df_new["KM"]
sample2 = df_new["Fare"]
In [40]:
#Spearman Rank Correlation

from scipy.stats import spearmanr
corr, p = spearmanr(sample1, sample2)
                    
print('corr=%.3f, p=%.3f' % (corr,p))

if p < 0.05:
  print("Reject Ho and accept Ha")
else:
  print("Reject Ha and accept H0")
corr=1.000, p=0.000
Reject Ho and accept Ha
In [41]:
#Pearson Correlation

from scipy.stats import pearsonr
corr, p = pearsonr(sample1, sample2)
                    
print('corr=%.3f, p=%.3f' % (corr,p))

if p < 0.05:
  print("Reject Ho and accept Ha")
else:
  print("Reject Ha and accept H0")
corr=1.000, p=0.000
Reject Ho and accept Ha
In [42]:
sns.lineplot(data=df_new,x="KM",y="Fare")
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5abefe6050>

Both the correlation tests resulted a correlation value of 1. This means that "Distance Travelled" and "Fare" parameters are in complete positive linear relationship. Also, from the above test, Null hypothesis is rejected.

So, "Distance Travelled" and "Fare" are correlated.

1 b. Correlation between Distance Travelled and Time duration

In general, it is known that whenever distance increases, Time also inceases. In order to prove this, we are going to perform the collowing correlation tests:

Null Hypothesis: There's no relation between Distance Travelled and Time Duration

Alternate Hypothesis: There's a between Distance Travelled and Time Duration

Test condition: For a 95% confidence, if p value is less than 0.05, then reject null hypothesis, else accept null hypothesis.

In [43]:
sample_1 = df_new["KM"]
sample_2 = df_new["Duration_min"]
In [44]:
#Spearman Rank Correlation

from scipy.stats import spearmanr
corr, p = spearmanr(sample_1, sample_2)
                    
print('corr=%.3f, p=%.3f' % (corr,p))

if p < 0.05:
  print("Reject Ho and accept Ha")
else:
  print("Reject Ha and accept H0")
corr=0.784, p=0.000
Reject Ho and accept Ha
In [45]:
#Pearson Correlation

from scipy.stats import pearsonr
corr, p = pearsonr(sample_1, sample_2)
                    
print('corr=%.3f, p=%.3f' % (corr,p))

if p < 0.05:
  print("Reject Ho and accept Ha")
else:
  print("Reject Ha and accept H0")
corr=0.748, p=0.000
Reject Ho and accept Ha
In [46]:
sns.scatterplot(data=df_new, x="KM", y="Duration_min",color="blue")
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5abe5a5910>

The correlation values from both the above tests are greater than 0.05. Also, both tests results in the rejection of null hypothesis.

So, There's a positive relationship between "Distance Travelled" and "Time taken".


Conclusion:

Both the above tests for Distance Travelled vs Time duration as well as Distance Travelled vs Fare resulted in a positive linear relationship.



2. Uber got more revenue from Non - cash payments than from cash payment

In order to check whether there's an effect of method of payment an revenue, we perform One-way Anova test.

One way Anova Testing:

Null Hypothesis H0: There's no relation between type of payment and the revenue.

Alternate Hypothesis Ha: There's relation between type of payment and the revenue.

Significance value (alpha) = 0.05

In [47]:
data1 = df_new[["Payment_Type","Fare"]]
data1
Out[47]:
Payment_Type Fare
0 Card 6663.0
1 Digital 175.2
2 Digital 132.3
3 Card 171.9
4 Digital 132.3
... ... ...
1150 Digital 129.0
1151 Digital 218.1
1152 Digital 373.2
1153 Digital 102.6
1154 Card 125.7

1075 rows × 2 columns

In [48]:
#replacing character values to numeric values in Payment_type column
#Cash = 1, Card =2, Digital =3

data1['Payment_Type'].replace(['Cash', 'Card','Digital'],
                        [1,2,3], inplace=True)
data1
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py:6619: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
Out[48]:
Payment_Type Fare
0 2 6663.0
1 3 175.2
2 3 132.3
3 2 171.9
4 3 132.3
... ... ...
1150 3 129.0
1151 3 218.1
1152 3 373.2
1153 3 102.6
1154 2 125.7

1075 rows × 2 columns

In [49]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

new = ols('Fare ~ Payment_Type', data = data1).fit()
In [50]:
anovaRes = sm.stats.anova_lm(new,typ=2)
anovaRes['PR(>F)']
Out[50]:
Payment_Type    0.944015
Residual             NaN
Name: PR(>F), dtype: float64

Here, the f value reopresents the variance of fare among 3 groups, and it shows that the variation is low.

Also, the p-value is very high = 0.944

0.944 > 0.05

Here, wew are accepting the null hypothesis, that the average revenue made by all the modes of payment are same.

Number of rides for modes of payment

In [51]:
x=sns.countplot(x=df_new['Payment_Type'])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)
In [52]:
sns.set_palette("colorblind")
x=sns.countplot(x=df_new['Payment_Method'])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

Revenue based on different types of payment

In [53]:
plt.figure(figsize=(10,7))
sns.set_palette("Paired")
plot=sns.stripplot(y=df_new['Payment_Method'],x=df['Fare'])
plot.set(xlabel='Fare',ylabel= 'Type of Payment',title='Fare vs Payment type')
plt.show()
In [54]:
data = df_new.groupby("Payment_Method")["Fare"].sum()
data
#grouping the data based on the type of fares
Out[54]:
Payment_Method
Card            100463.7
Cash             66279.9
Gift Card        47577.3
UPI/e-Wallet    101030.4
Uber Cash       115159.2
Name: Fare, dtype: float64
In [55]:
sns.set_palette("bright")
plt.figure(figsize=(10,8))
data.plot.pie(autopct="%.1f%%")
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5ab35df990>

It is found that Uber got more revenue from non-cash payments (84.6%) than cash payments (15.4%). Here Uber cash (Uber wallet payment) is the topmost contributor of payment mode followed by UPI payments.


Conclusion:

From the above analysis it is found that, the sum of revenue made by the digital payments is more than the card and cash payments.

But, proportionally, the number of digital payments made are higher comparatively to the other modes of payments.

This makes the average revenue made by different payment types are similar eventhough digital payments collected more revenue.



3. The Purpose of rides are having an effect on the days of the rides happened

In this test, we are about to determine whether there's any effect of weekdays parameter on the purpose of usage . ( Maybe weekends with more usage form Meal/Entertainment and weekdays with official activities).

Chi - Square Hypothesis Testing:

For the comparision of 2 categorical variables, we are going to perform chi-square test.

Null Hypothesis H0: There's no relation between Weekday and Purpose.

Alternate Hypothesis Ha: There's a relation between Weekday and Purpose.

Significance value (alpha): 0.05.

In [56]:
tab = pd.crosstab(df_new.Weekday, df_new.Purpose)   #creating a crosstab of 2 columns
tab
Out[56]:
Purpose Airport/Travel Between Offices Charity ($) Commute Customer Visit Errand/Supplies Meal/Entertain Meeting Moving Temporary Site
Weekday
Friday 5 2 0 0 16 21 45 51 0 15
Monday 8 5 0 0 15 17 49 32 0 8
Saturday 7 6 0 0 20 18 38 45 1 14
Sunday 3 5 0 0 24 27 46 46 0 17
Thursday 3 2 0 0 21 19 39 48 0 16
Tuesday 10 3 0 0 18 23 39 44 2 18
Wednesday 3 2 1 1 19 24 51 43 1 19
In [57]:
from scipy.stats import chi2_contingency
stat,p,dof,expected = chi2_contingency(tab)
print("stat=%.3f, p=%.3f" % (stat,p))
stat=47.538, p=0.720
In [58]:
if p < 0.05:
    print("Reject H0 and accept Ha")
else:
    print("Accept H0 and reject Ha")
Accept H0 and reject Ha

From the above, the stat value (difference between observed and expected values) is considerably larger and from the test results, it is clear that , there's no relationship between the Weekdays and the Purpose parameters.

Ride Category

In [59]:
sns.set_palette("Paired")
x=sns.countplot(x=df_new['Category'])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

It is clearly visible that more number of rides are made for business purposes than for personal.

Purpose of rides

In [60]:
plt.figure(figsize=(15,8))
x = sns.countplot(x=df_new['Purpose'])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

It is clearly evident that users use Uber mainly for Meeting and Entertainment activities than for other activities like Supplies, Customer Visit, Airport / Travel etc. Also they use the least for Charity and other commute activities.

4. Other data plots

I. Trips and Driver rating

In [61]:
freq_table = pd.crosstab(df_new['Driver_Rating'],'no_of_drivers')
freq_table
Out[61]:
col_0 no_of_drivers
Driver_Rating
0 171
1 180
2 169
3 192
4 185
5 178
In [62]:
plt.figure(figsize=(10,8))
sns.set_palette("rocket")
x=sns.countplot(x=df_new['Driver_Rating'])
plt.ylabel('Number of Drivers')
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='white', size=12)

From the above plot, it is found that the number of trips with different driver rating lies bwtween 169 and 192. Around 180 trips falls under "1 star driver rating" and around 178 trips falls under "5 star driver rating". 171 trips weren't rated. Highest number of drivers falls under "3 Star rating" and lowest number of drivers falls under "2 star rating".

II. Month vs Number of Trips

In [63]:
sns.set_palette("Paired")
plt.figure(figsize=(15,8))
x=sns.countplot(x=df_new['Month'])
x.set(xlabel="Frequency",ylabel="Month")
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

From the plot, it is found that the month January records more number of trips and June records the less number of trips.

III. Day vs Number of Trips

In [64]:
plt.figure(figsize=(10,8))
order = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
x=sns.countplot(x=df_new["Weekday"], order=order)
x.set(xlabel="Frequency",ylabel="Weekday")
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

When aggredated based on weekdays, Sunday records the most number of trips. This might be due to the fact that people spend their weekend in travelling.Followed by Wednesday lies on the 2nd position. On the other hand, Monday records the least number of trips.

IV. Time vs Frequency

In [65]:
sns.set_palette("pastel")
plt.figure(figsize=(15,8))
x=sns.countplot(x=df_new['SHour'])
x.set(xlabel="Hour",ylabel="Frequency")
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

The above plot is showing almost the expected results:

  • From Midnight to Early morning, records the lowest number of requests.
  • It gradually increases around the morning peak hour - 9AM to 12 Noon
  • Likewise, the frequency gradually reduces over the evening peak hour from 17:00 hr towards midnight, it marks the least number of requests.
  • But unexpectedly, the number of requests peaked at 13:00 - 15:00 hr.

V. Number of trips completed

In [66]:
x=sns.countplot(x=df_new["Status"])
for p in x.patches:
    x.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='left', va='top', color='black', size=12)

Among the total trips, 932 were completed and 143 trips were cancelled, which is around 15.34%

VI. Revenue based on purpose of trips

In [67]:
purpose = df_new.groupby("Purpose")["Fare"].sum()
purpose
Out[67]:
Purpose
Airport/Travel      17079.3
Between Offices      8366.4
Charity ($)           528.3
Commute              5976.6
Customer Visit      80691.9
Errand/Supplies     30021.9
Meal/Entertain     106197.0
Meeting            139326.3
Moving                994.5
Temporary Site      41328.3
Name: Fare, dtype: float64
In [99]:
plt.figure(figsize=(17,8))
x=sns.barplot(x="Purpose", y="Fare", data=df_new,ci = None, estimator=sum)

Based on purpose, Uber collected the maximum revenue in Meeting(140000), Meal/ Entertainment(105000) and Customer Visit(80000)

VII. PAIRPLOTS

In [69]:
sns.pairplot(df_new)
Out[69]:
<seaborn.axisgrid.PairGrid at 0x7f5ab36c2dd0>

VIII. CORRELATION MATRIX

In [70]:
correlation=df_new.corr()
In [71]:
plt.figure(figsize=(15,15))
sns.heatmap(correlation,cmap="coolwarm",annot=True,vmax=.5,vmin=-.5,center=0,square=True,linewidths=.5)
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5ab2ef8b10>

From the above correlation heatmap,

Highly Correlated:

  1. Positive correlation:

    • Km vs Duration_min
    • Duration_min vs Fare
    • KM vs Fare
  2. Negative correlation:

    • Fare_per_km vs Tip

Moderately correlated

  1. Positive correlation:

    • KM vs Tip
    • Duration_min vs Tip
    • Fare vs Tip
  2. Negative correlation:

    • KM vs Fare_per_km
    • Fare vs Fare_per_km


In [103]:
!sudo apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic
Reading package lists... Done
Building dependency tree       
Reading state information... Done
texlive-fonts-recommended is already the newest version (2017.20180305-1).
texlive-plain-generic is already the newest version (2017.20180305-2).
texlive-xetex is already the newest version (2017.20180305-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.
In [106]:
!jupyter nbconvert --to html /content/EDA_Uber.ipynb
[NbConvertApp] Converting notebook /content/EDA_Uber.ipynb to html
[NbConvertApp] Writing 1637947 bytes to /content/EDA_Uber.html