Florian Privé

R(cpp) enthusiast

Scraping some French medical school rankings

Written on September 10, 2017

In this post, I will analyze the results of the “épreuves classantes nationales (ECN)”, which is a competitive examination at the end of the 6th year of medical school in France. First ones get to choose first where they want to continue their medical training.

A very clean dataset

The data is in a PDF there. I’m not an expert in scraping and parsing data but this was actually very simple due to a well-formatted dataset.

If you see that I’m doing something too complicated or could do it cleaner or faster, please comment this post and help me learn some scraping/parsing jedi techniques.

Scraping

I’ll use package pdftools to get the text from this PDF.

head(txt <- pdftools::pdf_text("https://goo.gl/wUXvjk"), n = 1)
## [1] "                                                                              Paris, le 28 juin 2017\n     Liste des étudiants et des internes de médecine, classés par ordre de\n     mérite, ayant satisfait aux épreuves classantes nationales anonymes\n    donnant accès au troisième cycle des études médicales, organisées au\n                            titre de l’année universitaire 2017-2018.\n        Nota : Il est demandé aux étudiants de vérifier les informations d’état civil les\n  concernant et d’envoyer à la gestionnaire des ECN la copie d’une pièce d’identité pour\n  prise en compte des modifications. Ces corrections sont importantes avant parution de\n                      cette liste au Journal officiel de la République française.\n0001 Mme Beaumont (Anne-Lise), née le 1 septembre 1993.\n0002 M. Petitdemange (Arthur, Paul, Joseph), né le 15 septembre 1993.\n0003 M. Bacon (Seraphin, Charly, Philippe), né le 29 janvier 1993.\n0004 M. Ditac (Geoffroy, Arnaud, André), né le 23 juin 1994.\n0005 M. Faure (Guillaume, Thomas), né le 7 août 1992.\n0006 Mme Weil (Amandine), née le 4 mars 1992.\n0007 M. Ezzouhairi (Nacim), né le 1 juin 1993.\n0008 M. Coulon (Antoine), né le 11 juin 1992.\n0009 Mme Le Gaudu (Violette, Luce, Catherine), née le 3 juin 1994.\n0010 M. Boyer (Jeremy), né le 14 mai 1993.\n0011 M. Villemaire (Axel, Michaël, Pierre), né le 24 juillet 1991.\n0012 M. Azoulay (Levi-Dan), né le 13 décembre 1993.\n0013 M. Assouline (Allan), né le 14 janvier 1994.\n0014 M. Rouchaud (Aymeric), né le 16 mai 1993.\n0015 M. Padden (Michael, James), né le 6 février 1993.\n0016 M. Gavoille (Antoine, Paul, Jean), né le 12 juillet 1994.\n0017 M. Marie (Benjamin, Pierre, Alexandre), né le 14 mars 1994.\n0018 Mlle Boulle (Charlotte, Marie, Cécile), née le 3 septembre 1988.\n0019 Mme Chatelain (Juliette), née le 23 juillet 1993.\n0020 Mme Laporte (Amandine, Capucine), née le 9 octobre 1993.\n0021 M. D'izarny Gargas (Thibaut, François, Arnaud), né le 2 mai 1993.\n0022 Mme Boccon Gibod (Clémentine, Raphaëlle), née le 24 avril 1993.\n0023 Mme Chan (Camille, Marie), née le 11 juillet 1990.\n0024 M. Lemasle (Aymeric), né le 7 avril 1994.\n0025 M. Sulman (David), né le 9 novembre 1992.\n0026 Mlle Fresnel (Clémentine), née le 5 mars 1992.\n0027 M. Dumortier (Pierre, Antoine, Frédéric), né le 19 juillet 1993.\n0028 Mme Torres-Villaros (Héloïse, Laure, Barbara), née le 18 septembre 1992.\n0029 M. Memmi (Benjamin), né le 3 décembre 1993.\n0030 Mme Kherabi (Yousra), née le 23 décembre 1993.\n                                                              1\n"

Parsing

I’ll use the little I know about regular expressions to parse this data.

pat <- "([0-9]{4} [M\\.|Mme|Mlle]{1}.*?, [né|née]{1}.*?)\\."
data <- unlist(gsubfn::strapply(txt, pattern = pat))

head(data)
## [1] "0001 Mme Beaumont (Anne-Lise), née le 1 septembre 1993"              
## [2] "0002 M. Petitdemange (Arthur, Paul, Joseph), né le 15 septembre 1993"
## [3] "0003 M. Bacon (Seraphin, Charly, Philippe), né le 29 janvier 1993"   
## [4] "0004 M. Ditac (Geoffroy, Arnaud, André), né le 23 juin 1994"         
## [5] "0005 M. Faure (Guillaume, Thomas), né le 7 août 1992"                
## [6] "0006 Mme Weil (Amandine), née le 4 mars 1992"
library(stringr)

data_parsed <- matrix(NA_character_, length(data), 7)
data_words <- str_extract_all(data, boundary("word"))
data_parsed[, 1:4] <- t(sapply(data_words, head, n = 4))
data_parsed[, 5:7] <- t(sapply(data_words, tail, n = 3))
head(data_parsed)
##      [,1]   [,2]  [,3]           [,4]        [,5] [,6]        [,7]  
## [1,] "0001" "Mme" "Beaumont"     "Anne"      "1"  "septembre" "1993"
## [2,] "0002" "M"   "Petitdemange" "Arthur"    "15" "septembre" "1993"
## [3,] "0003" "M"   "Bacon"        "Seraphin"  "29" "janvier"   "1993"
## [4,] "0004" "M"   "Ditac"        "Geoffroy"  "23" "juin"      "1994"
## [5,] "0005" "M"   "Faure"        "Guillaume" "7"  "août"      "1992"
## [6,] "0006" "Mme" "Weil"         "Amandine"  "4"  "mars"      "1992"
suppressMessages(library(tidyverse))

data_parsed2 <- as_tibble(data_parsed) %>%
  transmute(
    ranking = as.integer(V1),
    is_male = (V2 == "M"),
    family_name = V3,
    first_name = V4,
    birth_date = pmap(list(V5, V6, V7), function(d, m, y) {
      paste(d, m, y, collapse = " ")
    }) %>% lubridate::dmy()
  ) 

data_parsed2
## # A tibble: 8,370 x 5
##    ranking is_male  family_name first_name birth_date
##      <int>   <lgl>        <chr>      <chr>     <date>
##  1       1   FALSE     Beaumont       Anne 1993-09-01
##  2       2    TRUE Petitdemange     Arthur 1993-09-15
##  3       3    TRUE        Bacon   Seraphin 1993-01-29
##  4       4    TRUE        Ditac   Geoffroy 1994-06-23
##  5       5    TRUE        Faure  Guillaume 1992-08-07
##  6       6   FALSE         Weil   Amandine 1992-03-04
##  7       7    TRUE   Ezzouhairi      Nacim 1993-06-01
##  8       8    TRUE       Coulon    Antoine 1992-06-11
##  9       9   FALSE           Le      Gaudu 1994-06-03
## 10      10    TRUE        Boyer     Jeremy 1993-05-14
## # ... with 8,360 more rows

Note: there is a problem with people who have a family name composed of multiple words.

Analysis

Proportion male/female

mean(data_parsed2$is_male)
## [1] 0.4345281

I’m still a bit surprised there are only 43% of males in French medical schools.

How old are they?

myggplot <- function(...) bigstatsr:::MY_THEME(ggplot(...))

myggplot(data_parsed2) +
  geom_histogram(aes(x = birth_date), bins = 100)

If one pass without repeating any year, they would be born in 1993, like me. There are a lot of people who repeat the first year because it is a very selective competitive examination, so who were born in 1992. Yet, there are quite a lot of older people and even some very young ones (we’ll see better in another plot).

How males compare to females when it comes to ranking?

myggplot(mutate(data_parsed2, prop_male = cummean(data_parsed2$is_male))) + 
  geom_hline(yintercept = mean(data_parsed2$is_male), col = "red") +
  geom_line(aes(x = ranking, y = prop_male))

Even though the first one is a female, among the best ranked people, there is a majority of males.

Ranking versus Age

(myggplot(data_parsed2) +
   geom_point(aes(ranking, birth_date, color = is_male)) +
   aes(text = bigstatsr::asPlotlyText(data_parsed2))) %>%
  plotly::ggplotly(tooltip = "text")

We can see a girl of only 19 year old (with a really nice ranking!) and a 54-year old man (with a less nice ranking).

myggplot(data_parsed2, aes(ranking, birth_date)) +
  geom_point() +
  geom_smooth(aes(color = is_male), lwd = 2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Conclusion

It was interesting to analyze this dataset. It would have been interesting to know from which school each person come from to compare rankings of French medical schools. Maybe it would be possible to join the data with some other data and do it (I’ll let someone else do it).