Wearable technology is broadly used over the past years and the most hype one is smartwatch. It allows people to do anything from their wrist, such as using it as submission of their mobile phone to monitoring their fitness.
If you are an iPhone and Apple Watch user, you may be familiar with:
- Health app that tracks, gathers, integrates, and summarizes our health information.
- Fitness app where we could take an in-depth and valuable insights of our workout.
I have been wearing Apple Watch 4 since 2019. Per September 2022, I am a gym member, exercising almost every day, and falling in love with running. Although Health app offers great visualizations of the recorded fitness data, I found there are some physical indicators I would like to elaborate myself and compare at a glance in order to see the progress of my own body, specifically since the moment when I has started my serious commitment with exercise and lost about fifteen kilograms of body weight.
Previously, I wrote about how to do running analysis from Garmin by using Python. For this time, we will gather our data from Apple Health which undoubtedly have a distinct way to extract then fix the dataset to make it presentable in visualization and analysis processes.
Retrieving Apple Health Data
The easiest way to get it is:
- Go to “Health” app on our iPhone
- Click on our profile icon on the top right corner
- Select “Export All Health Data” on the bottom. This process might take some time.
- Transfer it to your PC/laptop by email, AirDrop, or any other method we prefer.
After that, we have to unzip the file to see inside that there are:
- Folders:
- electrocardiograms
- workout-routes → workout location data
- Files:
- export_cda.xml
- export.xml → contains our concerned data stored in various tags: record, WorkoutActivity, WorkoutEvent, WorkoutStatistics, WorkoutRoute, ActivitySummary, ClinicalRecord, Audiogram, and much more data that I just realized Apple offers astounding features to keep track of our health information if we could exploit them properly.
Record and Workout attribute list interest me the most. Record exhibits the value of standard physical quantification while Workout stores the detail workout information we have been done since we wear Apple Watch. Thus, we will focus on how to pick and handle the data inside these attributes.
Requirements
- export.xml file as we have downloaded above
- Jupyter notebook (.ipynb) file
- Python 3.9.13
- The libraries needed:
- Pandas 1.4.3
- Plotly 5.4.1
- IPython 7.31.1
Import Libraries
First, we will import the packages in our .Jupyter notebook (.ipynb):
import xml.etree.ElementTree as ET import pandas as pd import datetime as dt import plotly.express as px import plotly.graph_objects as go from plotly.subplots import make_subplots from IPython.display import display
Import and Load the XML file
Our health data is stored in XML (eXtensible Markup Language) format, a file that consist of plain text and tags. Each tags correspond to type of stored data. Prior to further exploration, we have to parse the file using ElementTree
.
# create element tree object tree = ET.parse('apple_health_export/export.xml')
Data Collection in <Record>
We will stored the data of <Record> in a list of dictionaries named record_list
.
# extract the attributes of health record root = tree.getroot() record_list = [x.attrib for x in root.iter('Record')] record_list
Preview of the record_list
:
Then, we create the DataFrame of record_list
called record_data
.
# create a DataFrame from record_list record_data = pd.DataFrame(record_list) # print the information of record_data print(record_data.info()) # show the record_data DataFrame display(record_data)
Output:
As we can see above, we have 1,648,287 information and various types of data in type column. To identify what kind of data in type column, we can run:
# unique elements in 'type' column of record_data record_data.type.unique()
Output:
Number of various physical measurements presented by Apple Watch is surprising me and exceeding my expectation. Yet, I will not inspect all of them that I believe I didn’t have much input and not really into them, such as SwimmingStrokeCount since I am not a regular swimmer, so we will clean them out later in Data Cleaning.
To inspect how Apple recorded each of those data, in this case for example Active Energy Burned, we can check it by running:
# return recorded Active Energy Burned record_data.loc[(record_data['type'].str.contains("ActiveEnergyBurned"))]
Output:
It exhibits that Apple recorded them in sparse and we need to aggregate into more practical format like in daily or monthly, depend on how we would like to present them in visualizations. Some crucial health metrics that needed to be accumulated in certain time frame are basal energy burned, active energy burned, distance walking-running, step count, and apple stand time.
Note: every measurement has their own unique log so further observation and strategy are needed to make them analyzable.
Data Cleaning in <Record>
Since our data are still in incorrect format, disorganize order, represented in a set of inaccurate number, and store irrelevant information, we will fix them in this process.
- In type column, we cut the name short by just keeping the part of identifiers. (e.g. HKQuantityTypeIdentifierHeight is renamed into Height)
- We have three types of date information: startDate, endDate, and creationDate. creationDate is the day when we export our file from Apple Health so I do not think this information matters to our further analysis. In case of endDate and startDate, my assumption is endDate only have hour differences with startDate so keeping only one of them will be sufficient. Therefore, I decided to take down creationDate and endDate.
- Modify startDate into date format and add Day and Month column to state the day and the month information.
All of those modifications are stored in record_data_cleaned
DataFrame.
# remove 'sourceName', 'sourceVersion', 'device', 'creationDate', 'endDate' columns record_data_cleaned = record_data.drop(['sourceName','sourceVersion', 'device', 'creationDate','endDate'], axis=1) # transform 'startDate' into date format # record_data['startDate'] = pd.to_datetime(record_data['startDate']).dt.strftime('%Y-%m-%d') record_data_cleaned['Day'] = pd.to_datetime(record_data['startDate']).dt.strftime('%A') record_data_cleaned['Date'] = pd.to_datetime(record_data['startDate']).dt.strftime('%Y-%m-%d') record_data_cleaned['Month'] = pd.to_datetime(record_data['startDate']).dt.strftime('%B') # value is numeric, NaN if fails record_data_cleaned['value'] = pd.to_numeric(record_data['value'], errors='coerce') # shorter observation names record_data_cleaned['type'] = record_data_cleaned['type'].str.replace('HKQuantityTypeIdentifier', '') record_data_cleaned['type'] = record_data_cleaned['type'].str.replace('HKCategoryTypeIdentifier', '') # reorder 'record_data' columns record_data_cleaned = record_data_cleaned[['type', 'Date','Day', 'Month','value','unit']]
As I stated before, I am not interested to explore every details from Apple Health, such as environmental audio exposure, walking asymmetry percentage, appetite changes, breast pain, and many more. Thus, I filtered record_data_cleaned
in a dictionary, record_data_df_dict
, by only selecting data type that I would like to explore: body mass, active energy burned, basal energy burned, distance walking-running, step count, stand time, walking speed, running speed, heart rate variability, resting heart rate, walking heart rate average, VO2 max, and heart rate recovery one minute.
# dictionary of DataFrames for filtered 'record_data' record_data_df_dict = {} # filter 'type' of 'record_data' record_types = [ 'BodyMass', 'ActiveEnergyBurned', 'BasalEnergyBurned', 'DistanceWalkingRunning', 'StepCount', 'AppleStandTime', 'WalkingSpeed', 'RunningSpeed', 'HeartRateVariabilitySDNN', 'RestingHeartRate', 'WalkingHeartRateAverage', 'VO2Max', 'HeartRateRecoveryOneMinute' ] # create new DataFrame for every interested data for record_type in record_types: record_data_df_dict[record_type] = record_data_cleaned.loc[(record_data_cleaned['type'].str.contains(record_type))].rename(columns={"value":record_type}).sort_values(by='Date')
Some observations should be accumulated daily and store it in record_data_df_dict_daily
dictionary by running this code:
# list of data 'type' that need to be summed daily key_get_sum = [ 'BasalEnergyBurned', 'ActiveEnergyBurned', 'DistanceWalkingRunning', 'StepCount', 'AppleStandTime' ] record_data_df_dict_daily = {} for key in key_get_sum: record_data_df_dict_daily[key] = record_data_df_dict[key].groupby(record_data_df_dict[key]['Date']).agg({key: 'sum', 'Day': lambda x: x.mode().iat[0]}).reset_index()
I am also captivated to observe some data in monthly time frame. Thus, I keep them in monthly dictionary, record_data_df_dict_monthly
.
record_data_df_dict_monthly = {} for key in key_get_sum: record_data_df_dict_monthly[key] = record_data_df_dict[key].groupby(record_data_df_dict[key]['Date'].str[:-3]).agg({key: 'sum', 'Month': lambda x: x.mode().iat[0]}).reset_index()
I have been wearing Apple Watch since approximately 2020. But I only started my workout routine since September 2022. So, I would like to understand the transformation before and after exercise.
# Before vs After Workout # Workout routine starts from 1 September 2022 -> data filter after workout are specified starts on this date # Body mass progress before and after Workout record_data_df_BodyMass_start_Sep22 = record_data_df_dict["BodyMass"].loc[(record_data_df_dict["BodyMass"]['Date'] >= '2022-09-01')] #Active Energy Burned before and after workout routine record_data_df_ActiveEnergyBurned_before_workout = record_data_df_dict_monthly["ActiveEnergyBurned"].loc[(record_data_df_dict_monthly["ActiveEnergyBurned"]['Date'] < '2022-08-31')] record_data_df_ActiveEnergyBurned_after_workout = record_data_df_dict_monthly["ActiveEnergyBurned"].loc[(record_data_df_dict_monthly["ActiveEnergyBurned"]['Date'] >= '2022-08-31')] # Basal Energy Burned before and after workout routine record_data_df_BasalEnergyBurned_before_workout = record_data_df_dict_monthly["BasalEnergyBurned"].loc[(record_data_df_dict_monthly["BasalEnergyBurned"]['Date'] < '2022-08-31')] record_data_df_BasalEnergyBurned_after_workout = record_data_df_dict_monthly["BasalEnergyBurned"].loc[(record_data_df_dict_monthly["BasalEnergyBurned"]['Date'] >= '2022-08-31')] # Distance Walking-Running before and after workout routine record_data_df_Distance_before_workout = record_data_df_dict_monthly["DistanceWalkingRunning"].loc[(record_data_df_dict_monthly["DistanceWalkingRunning"]['Date'] < '2022-08-31')] record_data_df_Distance_after_workout = record_data_df_dict_monthly["DistanceWalkingRunning"].loc[(record_data_df_dict_monthly["DistanceWalkingRunning"]['Date'] >= '2022-08-31')] # Step count before and after workout routine record_data_df_StepCount_before_workout = record_data_df_dict_monthly["StepCount"].loc[(record_data_df_dict_monthly["StepCount"]['Date'] < '2022-08-31')] record_data_df_StepCount_after_workout = record_data_df_dict_monthly["StepCount"].loc[(record_data_df_dict_monthly["StepCount"]['Date'] >= '2022-08-31')]
Data Collection in <Workout>
Now we will attain our workout data and save it in a dictionary called final_workout_dict
and turned it into a DataFrame, final_workout_df
.
final_workout_dict = [] workout_list = list(root.iter('Workout')) for i in range(len(workout_list)): workout_dict = workout_list[i].attrib WorkoutStatisticsList = list(workout_list[i].iter("WorkoutStatistics")) for i, WorkoutStatistics in enumerate(WorkoutStatisticsList): if "ActiveEnergyBurned" in WorkoutStatistics.attrib['type']: workout_dict['activeEnergyBurned'] = WorkoutStatistics.attrib['sum'] if "BasalEnergyBurned" in WorkoutStatistics.attrib['type']: workout_dict['basalEnergyBurned'] = WorkoutStatistics.attrib['sum'] final_workout_dict.append(workout_dict) final_workout_df = pd.DataFrame(final_workout_dict) #create final_workout_df dataframe final_workout_df.head
Sneak peak of the final_workout_df
:
Data Cleaning in <Workout>
Just like in record data, final_workout_df is also needed to be managed and organized by running:
#drop 'creationDate' and 'endDate' column final_workout_df_cleaned = final_workout_df.drop(['sourceName','sourceVersion', 'device', 'creationDate','endDate'], axis=1) # transform creationDate into date format final_workout_df_cleaned['Date'] = pd.to_datetime(final_workout_df['startDate']).dt.strftime('%Y-%m-%d') final_workout_df_cleaned['Day'] = pd.to_datetime(final_workout_df['startDate']).dt.strftime('%A') # rename Activity Type final_workout_df_cleaned['workoutActivityType'] = final_workout_df['workoutActivityType'].str.replace('HKWorkoutActivityType','') # reorder column final_workout_df_cleaned = final_workout_df_cleaned[['Day', 'Date', 'workoutActivityType', 'duration', 'durationUnit', 'activeEnergyBurned', 'basalEnergyBurned']] # transform data type of 'duration' from object into float final_workout_df_cleaned['duration'] = final_workout_df['duration'].astype(float) # transform data type of 'activeEnergyBurned' and 'basalEnergyBurned' from object into float final_workout_df_cleaned['activeEnergyBurned'] = final_workout_df['activeEnergyBurned'].astype(float) final_workout_df_cleaned['basalEnergyBurned'] = final_workout_df['basalEnergyBurned'].astype(float) display(final_workout_df_cleaned)
Output:
Then, I keep my workout data when I regularly workout since September 2022 in separate DataFrame: workout_routine_df
.
# filter workout data starts from when the regular workout was started (September 2022) workout_routine_df = final_workout_df_cleaned.loc[(final_workout_df_cleaned['Date'] >= '2022-09-01')] display(workout_routine_df)
Output:
Summary
To sum up, we have some DataFrames:
record_data
→ all recorded data as an element that has <Record> tag in xml file.record_data_cleaned
→ record_data that has information of type, date, day, value, and unit.record_data_df_dict
→ record_data_cleaned that has been:- sorted by date
- the value column has been renamed to its data name
- filtered only to the interested data.
record_data_df_dict_daily
→ value of record_data_df_dict that has been accumulated daily.record_data_df_dict_monthly
→ aggregated record_data_df_dict every monthfinal_workout_df
→ workout DataFramefinal_workout_df_cleaned
→ cleaned workout DataFrameworkout_routine_df
→ filtered final_workout_df starts from the beginning of workout routine.
On the next post, those DataFrames are our keys to visualize and get insights on our wellness retrospectively so we could understand on how should we maintain or improve our well-being.