

# proportion of scenes each character was in Summarise(num_scenes = n_distinct(season_ep_scene)) %>%

Unite(season_ep_scene, season, episode, scene, remove = FALSE) %>%

Mutate(proportion = round((num_episodes / total_episodes) * 100, 1)) %>% # proportion of episodes each character was in Since, I am working with the data, I’m going with the idea that there were 186 episodes total. counts Niagara parts 1 & 2 as one episode and The Delivery parts 1 & 2 as one episode instead of two. The data from closely matches the episode breakdown on IMdB with the exception of season 6. Wikipedia counts some episodes like “A Benihana Christmas” as two, but I’m not sure why. Searching around on the interwebs indicates that there were 201 episodes of the office, however the data I have contains 186 episodes. Summarise(num_episodes = n_distinct(season_ep)) %>% Unite(season_ep, season, episode, remove = FALSE) %>% Mutate_at(vars(speaker), funs(str_replace_all(., "micheal|michel|michae$", "michael"))) Mutate_at(vars(speaker), funs(tolower)) %>% Mutate(actions = str_extract_all(line_text, "\\"), Fix misspellings in the speaker field (e.g. Micheal instead of Michael).Some entries for speakers have actions (), which I’ll remove.Change speaker to lower case since there is some inconsistent capitalization.For now I’m just going to replace all instances with ’ since that seems to be the majority of the cases There are 4000+ instances of ? found in the data mainly in the last two seasons.Remove text in brackets () and put in a new column called actions.There are some clean up steps we need to do: This data, like the majority of data isn’t perfect, but it’s in pretty good shape. Probably a smoker, so… So that’s the way it’s done. That was a woman I was talking to, so… She had a very low voice. Just wanted to talk to you manager-a-manger. I am the Regional Manager of Dunder Mifflin Paper Products. Yes, I’d like to speak to your office manager, please. So you’ve come to the master for guidance? Is this what you’re saying, grasshopper?Īctually, you called me in here, but yeah.Īll right. Fortunately, someone created a googlesheet sourced from with every line from The Office.
