Skip to content

TimiLin01/BellaBeat_Case_Study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 

Repository files navigation

BellaBeat Case Study

Google Data Analytics Capstone Project

bellabeat

Contents

Company Background

BellaBeat was founded by Urška Sršen and Sandro Mur in 2013. It is a high-tech company that manufactures health-focused smart products and collects data on the activity, sleeps, stress and reproductive health of its customers in order to provide information regarding their health and habits.

Business Task

Identifying the trends in smart device usage and providing recommendations that can help stakeholders to make data-driven decisions regarding the marketing strategy and discover new potentials for the company.

Key Stakeholders

  • Urška Sršen: BellaBeat Co-founder and Chief Creative Officer (CCO)
  • Sandro Mur: Mathematician, BellaBeat Co-founder and Chief Executive Officer (CEO)
  • BellaBeat Marketing Team

About the Data

The dataset[1] I used is FitBit Fitness Tracker Data, which is available on Kaggle and provided by MÖBIUS.
It is collected from eligible Fitbit users who consented to submit personal tracker data via a survey distributed by Amazon Mechanical Turk between 03.12.2016 - 05.12.2016.
This dataset has overall 18 CSV files, including information about the minute-level output for physical activity, heart rate, weight and sleep monitoring between 04.12.2016 - 05.12.2016 (31 days period).

Data Storage and Organization

After downloading the dataset, I opened all the CSV files in the spreadsheet (EXCEL) to see how are the data structured.
The dataset includes wide and narrow data. Each user has a unique ID so that we can distinguish between them. I noticed that 'dailyCalories_merged.csv', 'dailyIntensities_merged.csv' and 'dailySteps_merged.csv' are all included in 'dailyActivity_merged.csv', so I decided not to use these three files.
I also discovered that some tables have too many rows. Hence using BigQuery to analyse the data will be a better choice than the spreadsheet.
So I stored the below files in BigQuery under the project BellaBeat Case Study (ID: bellabeat-timilin), dataset Fit-Data:

Table Name CSV File Name Discription
daily_activity dailyActivity_merged.csv Daily activity over 31 days period of 33 users. Tracking the daily steps, distance, intensities, and calories.
heartrate heartrate_seconds_merged.csv 14 users' heartrate during the day.
hourly_cal hourlyCalories_merged.csv Hourly calories burnt over 31 days period of 33 users.
hourly_intensities hourlyIntensities_merged.csv Hourly total intensity over 31 days of 33 users.
hourly_step hourlySteps_merged.csv Hourly steps over 31 days of 33 users.
sleep_day sleepDay_merged.csv 24 users' sleeping info over 31 days, including total sleep records, total minutes asleep and total time in bed.
weight_log weightLogInfo_merged.csv 8 users' weight in Kg and Pounds, and their BMI, Fat, Manual Entry or not recorded in a day, over 31 days.

Credibility and Integrity

To determine the credibility and integrity of the data, I will use the 'ROCCC' system:

  • Reliability: The data is not reliable since we are not sure what the marginal error is, and a small sample size (~30 participants) has been used. So our analysis might not be true for the whole population.
  • Originality: The data is not original since it was collected by Amazon Mechanical Turk and made available through MÖBIUS but has been checked against [1].
  • Comprehensiveness: The data is not comprehensive because we have no information regarding whether our samples are randomly collected or not. This might lead to bias.
  • Current: The data is not current since it was collected in 2016. Therefore our analysis cannot represent the current trend in smart device usage.
  • Cited: It is cited.
    As a result, our data do not satisfy the 'ROCCC' system. This means that using this dataset, we are unable to provide reliable and comprehensive recommendations for BellaBeats. Hence, our analysis can only act as directions which should be verified through a more reliable dataset.

Uploading and Transforming Data

When I tried to upload the above CSV files into BigQuery, only the schema for 'dailyIntensities_merged.csv' could be detected automatically. For other files, the below error occurs:

Failed to create table: Error while reading data, error message: Could not parse '4/12/2016 12:00:00 AM' as TIMESTAMP for field ActivityHour (position 1) starting at location 26 with message 'Invalid time zone: AM'

Therefore, I had to manually define the schema for them and let ActivityHour temporarily be STRING instead of TIMESTAMP.
Now, having the files uploaded. I checked the datatypes for each table. Since we temporarily set the ActivityHour/Time/SleepDay as STING for some tables, we can now set them as DATETIME using the following code:

-- eg. For heartrate table
CREATE OR REPLACE TABLE `bellabeat-timilin.Fit_Data.heartrate`
AS
SELECT 
      Id,
      PARSE_DATETIME("%m/%d/%Y %l:%M:%S %p",Time) AS Time,
      Value
FROM 
     `bellabeat-timilin.Fit_Data.heartrate` 

And similarly we can do this for hourly_cal, hourly_intensities, hourly_step, sleep_day and weight_log tables.
(See code in the Appendix)

Data Cleaning

First, we check whether there is any NULL Value in our dataset:

-- eg. For daily_activity table
SELECT
      *
FROM
      `bellabeat-timilin.Fit_Data.daily_activity`
WHERE
      Id IS NULL OR
      ActivityDate IS NULL OR
      TotalSteps IS NULL OR
      TotalDistance IS NULL OR
      TrackerDistance	IS NULL OR
      LoggedActivitiesDistance IS NULL OR
      VeryActiveDistance	IS NULL OR
      ModeratelyActiveDistance	IS NULL OR
      LightActiveDistance	IS NULL OR
      SedentaryActiveDistance	IS NULL OR
      VeryActiveMinutes	IS NULL OR
      FairlyActiveMinutes	IS NULL OR
      LightlyActiveMinutes	IS NULL OR	
      SedentaryMinutes	IS NULL OR
      Calories	IS NULL

We only found NULL Values in the weight_log table, under the Fat column. These NULL Values can be ignored if we do not want to use information regarding users' fat.
Next, we search for any duplicates:

-- eg. For sleep_day table
SELECT 
      Id,
      Time,
      TotalSleepRecords,
      TotalMinutesAsleep,
      TotalTimeInBed,
      COUNT(*) AS No_Of_Dup
FROM
     `bellabeat-timilin.Fit_Data.sleep_day`
GROUP BY
     Id,
     Time,
     TotalSleepRecords,
     TotalMinutesAsleep,
     TotalTimeInBed
HAVING No_Of_Dup > 1

And we can actually find duplicates in the sleep_day table:

Id Time TotalSleepRecords TotalMinutesAsleep TotalTimeInBed No_Of_Dup
4388161847 2016-05-05T00:00:00 1 471 495 2
8378563200 2016-04-25T00:00:00 1 388 402 2
4702921684 2016-05-07T00:00:00 1 520 543 2

Hence, we need to remove three rows in total from this table. To do this, we can go back temporarily to EXCEL, and by filtering for these rows, we can easily delete the duplicates. I chose to do this in EXCEL because we only have three rows to delete, but if we have more, it is better to use R.
This gives a new table:

Table Name CSV File Name Discription
sleep_day_new sleepDay_merged_cleaned.csv Cleaned version of sleep_day

(See codes in the Appendix)

Checking the # of users

Before we start analysing our data, it is always good to ensure how many unique IDs are in different tables because not all participants answer all the questions.

-- eg. For daily_activity table
SELECT
    COUNT(DISTINCT Id) AS No_of_users
FROM `bellabeat-timilin.Fit_Data.daily_activity`
Table Name daily_activity heartrate hourly_cal hourly_intensities hourly_step sleep_day_new weight_log
No_of_users 33 14 33 33 33 24 8

(See code in the Appendix)

Average Steps VS Average Active Minutes

The Centers for Disease Control and Prevention (CDC) suggest that adults should aim for 10000 steps per day [2]. But recently, there have been researches [3] which also said that 7000 steps are already enough. Therefore, we can first investigate how many users walk more than 7000 steps a day on average.

SELECT 
      DISTINCT Id,
      ROUND(AVG(TotalSteps),3) AS daily_avg_step
FROM
      `bellabeat-timilin.Fit_Data.daily_activity`
GROUP BY
      Id
HAVING daily_avg_step >= 7000

This gives us back 20 users out of 33, which is 60.6% of our sample population.
However, does having more steps means that you were more active and did more sport than others who completed fewer steps? Not necessarily, since step numbers can be faked. Therefore, we should look at the active minutes.
An adult should aim for at least 30 "active minutes" per day [4]. But how can we link this with our dataset? Actually, Fitbit measures active minutes, but in our data, it is split into "VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes" and "SedentaryMinutes". And the "active minutes" we want to measure is indeed "VeryActiveMinutes" + "FairlyActiveMinutes" = "Active Minutes" [5].
Therefore, to gain the average active minutes:

SELECT 
      DISTINCT Id,
      ROUND(AVG(TotalSteps),3) AS daily_avg_step,
      ROUND(AVG(VeryActiveMinutes+FairlyActiveMinutes),3) AS daily_avg_minute
FROM 
      `bellabeat-timilin.Fit_Data.daily_activity`
GROUP BY
      Id
HAVING 
      daily_avg_step >= 7000 OR
      daily_avg_minute >= 30

And this time we get back 21 users out of 33, which is 63.6%. Moreover, we can also find users whose average step and average active minutes differ a lot, which might lead us to different conclusions.
For example:

Id Average Daily Steps Average Daily Active Minutes
6117666160 7046.714 3.607

During the week

After checking the daily averages, we can also investigate which days during the week users are more active or complete their daily step goal.

Weekday Average Total Step Average Active Minutes No_of_Users
Sunday 6933.231 34.512 121
Monday 7780.867 37.108 120
Tuesday 8125.007 37.289 152
Wednesday 7559.373 33.88 150
Thursday 7405.837 31.367 147
Friday 7448.23 32.167 126
Saturday 8152.976 37.121 124

(See code in the Appendix)
We can see that except for Sunday, users complete their daily step goals on average. But if we look at the average active minutes, the minimum 30 active minutes is achieved throughout the week. One possibility for this situation might be that people tend to stay at home on Sunday, but they still do some exercise to keep fit.
Overall, users on average walk the most on Tuesday and Saturday, and their average active minutes are the highest on these two days as well.

During the day

We can go even deeper by checking on which hours during the day people are the most active.

SELECT
      DISTINCT Hours,
      ROUND(AVG(StepTotal),3) AS hourly_avg_step
FROM
(SELECT
       Hours,
       StepTotal
 FROM(
      SELECT
            EXTRACT(HOUR FROM Time) AS Hours,
            StepTotal
      FROM
          `bellabeat-timilin.Fit_Data.hourly_step`
     )
)
GROUP BY
        Hours
ORDER BY
        Hours

We found that the hourly_avg_step increases a lot around 8 AM and reaches its peak of 599.17 at 6 PM.

Average Calorie burnt

The U.S. Department of Health and Human Services [6] suggest that the average adult woman burns roughly 1,600 - 2,400 calories per day, and the average adult man uses 2,000 - 3,000 calories per day. However, to determine the exact numbers, we will need people's information regarding height, weight and gender. Unfortunately, we only have the weight and BMI information. Therefore, we will only examine the average calories burnt for each user during the 31 days and when users burn the most calories.

Per user

SELECT
      DISTINCT Id,
      ROUND(AVG(Calories),3) AS avg_cal,
      ROUND(AVG(VeryActiveMinutes + FairlyActiveMinutes),3) AS avg_active_minute
FROM
      `bellabeat-timilin.Fit_Data.daily_activity`
GROUP BY
      Id

We discovered that each user's average calorie expenses range from 1483.355 to 3436.581 Cal. And what we can definitely say is that there are four users who on average did not burn enough calories because their average calorie expenses are less than 1600 Cal.
I also retrieved the average active minutes for each user, but at the moment we cannot see any obvious relationship between active minutes and calorie expense. We can keep these data and visualise them later so that we might see some correlation.

During the day

AM 12 1 2 3 4 5 6 7 8 9 10 11
Average Calories 71.805 70.165 69.186 67.538 68.262 81.708 86.997 94.478 103.337 106.143 110.461 109.807
PM 12 1 2 3 4 5 6 7 8 9 10 11
Average Calories 117.197 115.309 115.733 106.637 113.327 122.753 123.492 121.485 102.358 96.056 88.265 77.594

(See code in the Appendix)
We found out that on average the highest calories of 123.492 Cal were burnt around 6 PM and the lowest of 67.538 Cal around 3 AM.

Heart Rate

The normal resting heart rate for adults is between 60 - 100 beats per minute (bpm) [7]. But when people are doing physical activities their heart rate usually exceeds 100 bpm. The maximum heart rate a person can reach is 220 - his/her age [8], however, we do not have any age-related data, so let's see their average, minimum and maximum heart rates first.

SELECT
      Id,
      ROUND(AVG(Value),3) AS avg_heartrate,
      MIN(Value) AS Min_heartrate,
      MAX(Value) AS Max_heartrate
FROM
      `bellabeat-timilin.Fit_Data.heartrate`
GROUP BY
      Id

All 14 users have an average heart rate between 60 and 100 bpm. We can also discover that most of the users have a minimum heart rate of around or even under 40 bpm, this is not a big concern for healthy young adults and trained athletes, since they commonly have 40 - 60 bpm during sleep and rest. But for a general adult heart rate under 60 bpm is qualified as bradycardia.
There are many records of a maximum heart rate exceeding 100 bpm, which can be qualified as tachycardia but can also be acceptable if the user was doing physical exercise. However, some records can be dangerous even if the user was doing exercise. For instance, user 2022484408 reached 203 bpm, which can be the estimated maximum age-related heart rate for a 17-year-old (220 - 203 = 17) user, but is still dangerous even if he/she was doing vigorous-intensity physical activity.
By assuming users sleep between 10 PM and 6 AM, we can look for their average heart rate during their sleep. Luckily, most of them have an average between 60 and 100 bpm, only two users had around 54 and 58 bpm.
(See code in the Appendix)

Sleep

According to the Centers for Disease Control and Prevention (CDC), the recommended hours of sleep per day changes as the user age. In general, an adult needs 7 - 9 hours of sleep a day. So let's find out the average sleeping hours for each user.

SELECT
      Id,
      ROUND((total_sleep/no_of_sleep)/60,3) AS avg_daily_sleephour
FROM
(SELECT
      DISTINCT Id,
      SUM(TotalMinutesAsleep) AS total_sleep,
      COUNT(TotalMinutesAsleep) AS no_of_sleep
FROM
      `bellabeat-timilin.Fit_Data.sleep_day_new`
GROUP BY
      Id
)

There are only 10 out of 24 users (41.6%) sleep 7 - 9 hours on average, 1 out of 24 sleeps over 9 hours (10.867h) and all of the rest of the users do not have enough sleep. We are not sure whether those users really do not have enough sleep or they only recorded their nap and not their evening sleep.

Weight, BMI and Height

As we said before, we only have the weight and BMI information of only 8 users. We can use the below equation to calculate the height of the users:

bellabeat-bmi

So for each user, we get:

Id Average Height(cm) Average Weight(kg) Average BMI
4558609924 160.0 69.64 27.214
4319703577 162.5 72.35 27.415
1503960366 152.4 52.6 22.65
8877689391 182.8 85.146 25.487
2873212765 162.6 57.0 21.57
5577150313 180.0 90.7 28.0
1927972279 167.6 133.5 47.54
6962181067 160.1 61.553 24.028

(See code in the Appendix)

Finally, we will examine user engagement in order to know the smart device usage.
We will categorise their engagement as below:

  • Very Active User: Have at least 25 activity records and total usage of 500 hours (20h/day).
  • Moderately Active User: Have at least 16 daily activity records and 320 hrs usage.
  • Lightly Active User: Have at least 7 daily activity records and 140 hrs usage.
  • Not Active User: Have at least 1 activity record and 20 hrs usage.
Id 4057192912 2347167796 8253242879 3372868164 6775888955 7007744171 6117666160 8792009665 6290855005 1644430081 3977333714 5577150313 1624580081 2022484408 4319703577 4388161847 4702921684 6962181067 7086361926 8583815059 1844505072 1927972279 2026352035 2320127002 2873212765 4020332650 4445114986 4558609924 5553957443 8053475328 8378563200 8877689391 1503960366
# of Active Records 4 18 19 20 26 26 28 29 29 30 30 30 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
Total Usage Hours 88.167 292.117 455.717 472.9 591.667 599.467 507.85 559.35 689.733 685.633 481.233 509.767 736.617 736.6 506.583 573.267 534.783 490.55 548.817 742.65 683.8 701.683 488.983 734.817 736.467 684.45 541.133 724.717 470.667 720.083 486.267 735.517 581.75
User Type Not Active User Lightly Active User Moderately Active User Moderately Active User Very Active User Very Active User Very Active User Very Active User Very Active User Very Active User Moderately Active User Very Active User Very Active User Very Active User Very Active User Very Active User Very Active User Moderately Active User Very Active User Very Active User Very Active User Very Active User Moderately Active User Very Active User Very Active User Very Active User Very Active User Very Active User Moderately Active User Very Active User Moderately Active User Very Active User Very Active User

(See code in the Appendix)
Overall we have 1 Not Active, 1 Lightly Active, 7 Moderately Active and 24 Very Active Users.

(All the graphs can be found on my Tableau under the BellaBeat Case Study viz)

Steps

1. The graph below shows the average steps during the day. We can see that around 3 AM, there are almost 0 steps, which is reasonable since people are sleeping. From 5 AM, the step number increases dramatically, people wake up and go to work. The average step number floats between 400 and 600 from 8 AM to 7 PM.

2. We now can see the average step walked by the users during the 31 days period. And we can notice that 60.6% of the users complete the daily 7000-step goal. Steps_Goal


3. Taking the average by weekdays, we can discover that except for Sunday, the daily 7000-step goal is completed on the other days.

1. Step number is not the only criterion to determine whether a user has done enough sport or not a day. We can also check their active minutes. An adult should have 30 active minutes daily, and we can see that only 51.5% of the users completed this goal. ActMin_Goal

2. Based on the above standard, the daily goal is completed on all the weekdays. And we notice that people are especially active on Monday, Tuesday and Saturday. This is also suggested from the Average Steps During the Week graph.

3. If we combine our 7000-step and 30-minute goals, we can see 63.6% of users completed at least one or both goals.

Calories Burnt

1. Most of the calories were burnt between 9 AM and 7 PM. Calories_day

2. We want to find the relationship between calories burnt and step number, so we tried to find the best fit trendline. We know that a trendline is most reliable when its R-squared value is at or near 1. Therefore we found out that the polynomial trendline fits our data the most. The graphs below both have a cubic trendline. However, if we try to increase the degree of our model, then we will get a closer R-squared value to 1. Since we only have around 30 data points, I chose not to visualise a higher degree model to avoid bias.
If we group our data by user Id, then the R-squared of the model is only around 0.213, but if we group by date, then R-squared is 0.967. So might use this cubic model to estimate the daily average calories burnt given the average step, or vice versa. Calories_Step

3. Similar to above. Calories_ActMin

Key Findings

1. I have divided the users into four groups as defined above, and overall we have 72.73% very active users. They use a smart device almost on a daily basis for at least 20 hours during the period.

2. Based on our earlier findings, an adult needs at least 7 hours of sleep every day, so let's set this as our sleep goal. The below graphs show how many goals are completed by the 33 users. Since our sleep-related data only includes 24 users, so we can see a 27.27% of Null on both graphs. These are the users who did not record their sleeping records.
On the left, we can see that only one-third of the users had enough sleep, so apart from what we said earlier, that people might only record their naps, this might also indicate that users tend to take off their smart devices while sleeping.
On the right, we have 18.18% of users who live very healthily and completed all the 7000-step, 30-active minute and 7-hour sleep goals. And 15.15% of users did not complete any goals. This does not necessarily mean that these people are living unhealthy. They might just not be keen on using a smart device.

Based on the analysed data, to improve user engagement:

  • Introducing such smart devices that can be worn the whole day and do not need to be taken off. By emphasising the keywords like "can be worn all the day", "not harmful for the body" and "waterproof" can target those clients who are lazy about changing/wearing and taking off smart devices. Also, this helps BellaBeat to collect more data, like sleep-related ones and makes our analysis more accurate.
  • Add warning system into our smart device, which natofies our customers, when it detects too low/high heart rate. This can make more users relie on our smart device, also extend our targeted customer range. Especially those customers will be interested who are worried about their heart rate or having heart disease.
  • To motivate users to complete goals, we can introduce rewards when users complete a goal. Rewards, for instance, can be a collection of cute virtual badges. This can attract people who are passionate about collecting.
  • We can also add a sign-in form with rewards. Users can sign in when they have done enough sport.
  • To help users increase their active minutes, we can notify (like using vibrations) users when suitable, and users can only close these notifications when they have done enough sport. In order to make better recommendations, we need a more reliable analysis. Therefore we must continue to collect more data from more users.

[1] Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.53894
[2] Centers for Disease Control and Prevention (CDC). Stepping Up to Physcial Activity. Lifestyle Coach Facilitation Guide: Post-Core https://www.cdc.gov/diabetes/prevention/pdf/postcurriculum_session8.pdf
[3] Paluch, Amanda E., et al. "Steps per day and all-cause mortality in middle-aged adults in the coronary artery risk development in young adults study." JAMA network open 4.9 (2021): e2124516-e2124516. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2783711
[4] Bumgardner, W. "Why Your Fitbit Active Minutes Mean More Than Your Steps [2020-06-23]." https://www.verywellfit.com/why-active-minutes-mean-more-than-steps-4155747
[5] Semanik, P., Lee, J., Pellegrini, C. A., Song, J., Dunlop, D. D., & Chang, R. W. (2020). Comparison of physical activity measures derived from the Fitbit Flex and the ActiGraph GT3X+ in an employee population with chronic knee symptoms. ACR Open Rheumatology, 2(1), 48-52. https://onlinelibrary.wiley.com/doi/full/10.1002/acr2.11099
[6] U.S. Department of Health and Human Services. 2015–2020 Dietary Guidelines for Americans. https://health.gov/our-work/nutrition-physical-activity/dietary-guidelines/previous-dietary-guidelines/2015
[7] American Heart Association website. All About Heart Rate (Pulse). https://www.heart.org/en/health-topics/high-blood-pressure/the-facts-about-high-blood-pressure/all-about-heart-rate-pulse
[8] Centers for Disease Control and Prevention (CDC). Target Heart Rate and Estimated Maximum Heart Rate. https://www.cdc.gov/physicalactivity/basics/measuring/heartrate.htm
[9] Centers for Disease Control and Prevention (CDC). How Much Sleep Do I Need? https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html

For all the SQL codes, check the txt files.

About

BellaBeat Case Study - Google Data Analytics Capstone

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published