How to Extract and Clean Apple Health Data with Python?

May 25, 2023

Wearable technology is broadly used over the past years and the most hype one is smartwatch. It allows people to do anything from their wrist, such as using it as submission of their mobile phone to monitoring their fitness.

If you are an iPhone and Apple Watch user, you may be familiar with:

Health app that tracks, gathers, integrates, and summarizes our health information.
Fitness app where we could take an in-depth and valuable insights of our workout.

I have been wearing Apple Watch 4 since 2019. Per September 2022, I am a gym member, exercising almost every day, and falling in love with running. Although Health app offers great visualizations of the recorded fitness data, I found there are some physical indicators I would like to elaborate myself and compare at a glance in order to see the progress of my own body, specifically since the moment when I has started my serious commitment with exercise and lost about fifteen kilograms of body weight.

Previously, I wrote about how to do running analysis from Garmin by using Python. For this time, we will gather our data from Apple Health which undoubtedly have a distinct way to extract then fix the dataset to make it presentable in visualization and analysis processes.

Retrieving Apple Health Data

The easiest way to get it is:

Go to “Health” app on our iPhone
Click on our profile icon on the top right corner
Select “Export All Health Data” on the bottom. This process might take some time.
Transfer it to your PC/laptop by email, AirDrop, or any other method we prefer.

After that, we have to unzip the file to see inside that there are:

Folders:
- electrocardiograms
- workout-routes → workout location data
Files:
- export_cda.xml
- export.xml → contains our concerned data stored in various tags: record, WorkoutActivity, WorkoutEvent, WorkoutStatistics, WorkoutRoute, ActivitySummary, ClinicalRecord, Audiogram, and much more data that I just realized Apple offers astounding features to keep track of our health information if we could exploit them properly.

Preview of export.xml file

Record and Workout attribute list interest me the most. Record exhibits the value of standard physical quantification while Workout stores the detail workout information we have been done since we wear Apple Watch. Thus, we will focus on how to pick and handle the data inside these attributes.

Requirements

export.xml file as we have downloaded above
Jupyter notebook (.ipynb) file
Python 3.9.13
The libraries needed:
- Pandas 1.4.3
- Plotly 5.4.1
- IPython 7.31.1

Import Libraries

First, we will import the packages in our .Jupyter notebook (.ipynb):

import xml.etree.ElementTree as ET
import pandas as pd
import datetime as dt 
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import display

Import and Load the XML file

Our health data is stored in XML (eXtensible Markup Language) format, a file that consist of plain text and tags. Each tags correspond to type of stored data. Prior to further exploration, we have to parse the file using ElementTree.

# create element tree object 
tree = ET.parse('apple_health_export/export.xml')

Data Collection in <Record>

We will stored the data of <Record> in a list of dictionaries named record_list.

# extract the attributes of health record
root = tree.getroot()
record_list = [x.attrib for x in root.iter('Record')]
record_list

Preview of the record_list:

Then, we create the DataFrame of record_list called record_data.

# create a DataFrame from record_list
record_data = pd.DataFrame(record_list)

# print the information of record_data
print(record_data.info())

# show the record_data DataFrame
display(record_data)

Output:

Information of record_data.info()

record_data DataFrame

As we can see above, we have 1,648,287 information and various types of data in type column. To identify what kind of data in type column, we can run:

# unique elements in 'type' column of record_data
record_data.type.unique()

Output:

Number of various physical measurements presented by Apple Watch is surprising me and exceeding my expectation. Yet, I will not inspect all of them that I believe I didn’t have much input and not really into them, such as SwimmingStrokeCount since I am not a regular swimmer, so we will clean them out later in Data Cleaning.

To inspect how Apple recorded each of those data, in this case for example Active Energy Burned, we can check it by running:

# return recorded Active Energy Burned
record_data.loc[(record_data['type'].str.contains("ActiveEnergyBurned"))]

Output:

It exhibits that Apple recorded them in sparse and we need to aggregate into more practical format like in daily or monthly, depend on how we would like to present them in visualizations. Some crucial health metrics that needed to be accumulated in certain time frame are basal energy burned, active energy burned, distance walking-running, step count, and apple stand time.

Note: every measurement has their own unique log so further observation and strategy are needed to make them analyzable.

Data Cleaning in <Record>

Since our data are still in incorrect format, disorganize order, represented in a set of inaccurate number, and store irrelevant information, we will fix them in this process.

In type column, we cut the name short by just keeping the part of identifiers. (e.g. HKQuantityTypeIdentifierHeight is renamed into Height)
We have three types of date information: startDate, endDate, and creationDate. creationDate is the day when we export our file from Apple Health so I do not think this information matters to our further analysis. In case of endDate and startDate, my assumption is endDate only have hour differences with startDate so keeping only one of them will be sufficient. Therefore, I decided to take down creationDate and endDate.
Modify startDate into date format and add Day and Month column to state the day and the month information.

All of those modifications are stored in record_data_cleaned DataFrame.

# remove 'sourceName', 'sourceVersion', 'device', 'creationDate', 'endDate' columns
record_data_cleaned = record_data.drop(['sourceName','sourceVersion', 'device', 'creationDate','endDate'], axis=1)

# transform 'startDate' into date format 
# record_data['startDate'] = pd.to_datetime(record_data['startDate']).dt.strftime('%Y-%m-%d')
record_data_cleaned['Day'] = pd.to_datetime(record_data['startDate']).dt.strftime('%A')
record_data_cleaned['Date'] = pd.to_datetime(record_data['startDate']).dt.strftime('%Y-%m-%d')
record_data_cleaned['Month'] = pd.to_datetime(record_data['startDate']).dt.strftime('%B')


# value is numeric, NaN if fails
record_data_cleaned['value'] = pd.to_numeric(record_data['value'], errors='coerce')

# shorter observation names
record_data_cleaned['type'] = record_data_cleaned['type'].str.replace('HKQuantityTypeIdentifier', '')
record_data_cleaned['type'] = record_data_cleaned['type'].str.replace('HKCategoryTypeIdentifier', '')

# reorder 'record_data' columns
record_data_cleaned = record_data_cleaned[['type', 'Date','Day', 'Month','value','unit']]

As I stated before, I am not interested to explore every details from Apple Health, such as environmental audio exposure, walking asymmetry percentage, appetite changes, breast pain, and many more. Thus, I filtered record_data_cleaned in a dictionary, record_data_df_dict, by only selecting data type that I would like to explore: body mass, active energy burned, basal energy burned, distance walking-running, step count, stand time, walking speed, running speed, heart rate variability, resting heart rate, walking heart rate average, VO2 max, and heart rate recovery one minute.

# dictionary of DataFrames for filtered 'record_data'
record_data_df_dict = {}

# filter 'type' of 'record_data'
record_types = [
   'BodyMass',
   'ActiveEnergyBurned',
   'BasalEnergyBurned',
   'DistanceWalkingRunning',
   'StepCount',
   'AppleStandTime',
   'WalkingSpeed',
   'RunningSpeed',
   'HeartRateVariabilitySDNN',
   'RestingHeartRate',
   'WalkingHeartRateAverage',
   'VO2Max',
   'HeartRateRecoveryOneMinute'
   ]


# create new DataFrame for every interested data
for record_type in record_types:
   record_data_df_dict[record_type] = record_data_cleaned.loc[(record_data_cleaned['type'].str.contains(record_type))].rename(columns={"value":record_type}).sort_values(by='Date')

Some observations should be accumulated daily and store it in record_data_df_dict_daily dictionary by running this code:

# list of data 'type' that need to be summed daily
key_get_sum = [
    'BasalEnergyBurned', 
    'ActiveEnergyBurned',
    'DistanceWalkingRunning',
    'StepCount',
    'AppleStandTime'
    ]

record_data_df_dict_daily = {}
for key in key_get_sum:
    record_data_df_dict_daily[key] = record_data_df_dict[key].groupby(record_data_df_dict[key]['Date']).agg({key: 'sum', 'Day': lambda x: x.mode().iat[0]}).reset_index()

I am also captivated to observe some data in monthly time frame. Thus, I keep them in monthly dictionary, record_data_df_dict_monthly.

record_data_df_dict_monthly = {}
for key in key_get_sum:
    record_data_df_dict_monthly[key] = record_data_df_dict[key].groupby(record_data_df_dict[key]['Date'].str[:-3]).agg({key: 'sum', 'Month': lambda x: x.mode().iat[0]}).reset_index()

I have been wearing Apple Watch since approximately 2020. But I only started my workout routine since September 2022. So, I would like to understand the transformation before and after exercise.

# Before vs After Workout
# Workout routine starts from 1 September 2022 -> data filter after workout are specified starts on this date

# Body mass progress before and after Workout
record_data_df_BodyMass_start_Sep22 = record_data_df_dict["BodyMass"].loc[(record_data_df_dict["BodyMass"]['Date'] >= '2022-09-01')]

#Active Energy Burned before and after workout routine
record_data_df_ActiveEnergyBurned_before_workout = record_data_df_dict_monthly["ActiveEnergyBurned"].loc[(record_data_df_dict_monthly["ActiveEnergyBurned"]['Date'] < '2022-08-31')]
record_data_df_ActiveEnergyBurned_after_workout = record_data_df_dict_monthly["ActiveEnergyBurned"].loc[(record_data_df_dict_monthly["ActiveEnergyBurned"]['Date'] >= '2022-08-31')]

# Basal Energy Burned before and after workout routine
record_data_df_BasalEnergyBurned_before_workout = record_data_df_dict_monthly["BasalEnergyBurned"].loc[(record_data_df_dict_monthly["BasalEnergyBurned"]['Date'] < '2022-08-31')]
record_data_df_BasalEnergyBurned_after_workout = record_data_df_dict_monthly["BasalEnergyBurned"].loc[(record_data_df_dict_monthly["BasalEnergyBurned"]['Date'] >= '2022-08-31')]

# Distance Walking-Running before and after workout routine
record_data_df_Distance_before_workout = record_data_df_dict_monthly["DistanceWalkingRunning"].loc[(record_data_df_dict_monthly["DistanceWalkingRunning"]['Date'] < '2022-08-31')]
record_data_df_Distance_after_workout = record_data_df_dict_monthly["DistanceWalkingRunning"].loc[(record_data_df_dict_monthly["DistanceWalkingRunning"]['Date'] >= '2022-08-31')]

# Step count before and after workout routine
record_data_df_StepCount_before_workout = record_data_df_dict_monthly["StepCount"].loc[(record_data_df_dict_monthly["StepCount"]['Date'] < '2022-08-31')]
record_data_df_StepCount_after_workout = record_data_df_dict_monthly["StepCount"].loc[(record_data_df_dict_monthly["StepCount"]['Date'] >= '2022-08-31')]

Data Collection in <Workout>

Now we will attain our workout data and save it in a dictionary called final_workout_dict and turned it into a DataFrame, final_workout_df.

final_workout_dict = []
workout_list = list(root.iter('Workout'))

for i in range(len(workout_list)):
    workout_dict = workout_list[i].attrib
    WorkoutStatisticsList = list(workout_list[i].iter("WorkoutStatistics"))

    for i, WorkoutStatistics in enumerate(WorkoutStatisticsList):
        if "ActiveEnergyBurned" in WorkoutStatistics.attrib['type']:
            workout_dict['activeEnergyBurned'] = WorkoutStatistics.attrib['sum']
        if "BasalEnergyBurned" in WorkoutStatistics.attrib['type']:
            workout_dict['basalEnergyBurned'] = WorkoutStatistics.attrib['sum']

    final_workout_dict.append(workout_dict)

final_workout_df = pd.DataFrame(final_workout_dict) #create final_workout_df dataframe
final_workout_df.head

Sneak peak of the final_workout_df:

The first five output of final_workout_df.

Data Cleaning in <Workout>

Just like in record data, final_workout_df is also needed to be managed and organized by running:

#drop 'creationDate' and 'endDate' column
final_workout_df_cleaned = final_workout_df.drop(['sourceName','sourceVersion', 'device', 'creationDate','endDate'], axis=1)

# transform creationDate into date format 
final_workout_df_cleaned['Date'] = pd.to_datetime(final_workout_df['startDate']).dt.strftime('%Y-%m-%d')
final_workout_df_cleaned['Day'] = pd.to_datetime(final_workout_df['startDate']).dt.strftime('%A')

# rename Activity Type
final_workout_df_cleaned['workoutActivityType'] = final_workout_df['workoutActivityType'].str.replace('HKWorkoutActivityType','')

# reorder column
final_workout_df_cleaned = final_workout_df_cleaned[['Day', 'Date', 'workoutActivityType', 'duration', 'durationUnit', 'activeEnergyBurned', 'basalEnergyBurned']]

# transform data type of 'duration' from object into float
final_workout_df_cleaned['duration'] = final_workout_df['duration'].astype(float)

# transform data type of 'activeEnergyBurned' and 'basalEnergyBurned' from object into float
final_workout_df_cleaned['activeEnergyBurned'] = final_workout_df['activeEnergyBurned'].astype(float)
final_workout_df_cleaned['basalEnergyBurned'] = final_workout_df['basalEnergyBurned'].astype(float)

display(final_workout_df_cleaned)

Output:

Then, I keep my workout data when I regularly workout since September 2022 in separate DataFrame: workout_routine_df.

# filter workout data starts from when the regular workout was started (September 2022)
workout_routine_df = final_workout_df_cleaned.loc[(final_workout_df_cleaned['Date'] >= '2022-09-01')]
display(workout_routine_df)

Output:

Summary

To sum up, we have some DataFrames:

record_data → all recorded data as an element that has <Record> tag in xml file.
record_data_cleaned → record_data that has information of type, date, day, value, and unit.
record_data_df_dict → record_data_cleaned that has been:
- sorted by date
- the value column has been renamed to its data name
- filtered only to the interested data.
record_data_df_dict_daily → value of record_data_df_dict that has been accumulated daily.
record_data_df_dict_monthly → aggregated record_data_df_dict every month
final_workout_df → workout DataFrame
final_workout_df_cleaned → cleaned workout DataFrame
workout_routine_df → filtered final_workout_df starts from the beginning of workout routine.

On the next post, those DataFrames are our keys to visualize and get insights on our wellness retrospectively so we could understand on how should we maintain or improve our well-being.

Posted by:

Sekar Writes

I’m a nonfiction reader with a medical science background who loves turning tricky reads into simple, easy-to-understand insights. I break down big ideas into friendly bits, making every book feel fun and easy.

Sekar Writes

How to Extract and Clean Apple Health Data with Python?

Retrieving Apple Health Data

Requirements

Import Libraries

Import and Load the XML file

Data Collection in <Record>

Data Cleaning in <Record>

Data Collection in <Workout>

Data Cleaning in <Workout>

Summary

Comments

Leave a Reply Cancel reply

You Might Also Like

Review and Summary: Knead to Know

Review and Summary: Unreliable

Review and Summary: Being Mortal