Querying PubMed via the easyPubMed package in R

PubMed (NCBI Entrez) is an online database of citations for biomedical literature that is available at the following URL: http://www.ncbi.nlm.nih.gov/pubmed. Retrieving data from PubMed is also possible in an automated way via the NCBI Entrez E-utilities. A description of how NBCI E-utilities work is available at the following URL: http://www.ncbi.nlm.nih.gov/books/NBK25501/.
easyPubMed is a R package I wrote that allows to easily download content from PubMed in XML format. easyPubMed includes 3 functions: get_pubmed_ids(), fetch_pubmed_data() and batch_pubmed_download().


Note! This post discusses the use of easyPubMed version 2.23 or lower. The easyPubMed library has significantly evolved since this content was first made available. While the code below may NOT be compatible with the latest version of easyPubMed, this post is maintained in its original format for legacy purposes. For more info about the latest version of easyPubMed, please see: https://www.data-pulse.com/dev_site/easypubmed/.


Get easyPubMed

You can download the latest version of easyPubMed (dev version) from GitHub. A convenient way to install a package from GitHub is to use remotes. You can also obtain a specific branch/version of interest (here, version 2.23).

remotes::install_github(repo = “dami82/easyPubMed”, force = TRUE, build_vignettes = TRUE, ref = “dev_2_23”)

get_pubmed_ids() takes a character string as argument. This character string is just the text of the PubMed query we want to perform. You can use the same syntax you would use for a regular PubMed query. This function queries PubMed via the NCBI eSearch utility. The result of the query is saved on the PubMed History Server. This function returns a list that includes info about the query, the PubMed IDs of the first 20 results, a WebEnv string and a QueryKey string. These two strings are required by the fetch_pubmed_data() function, as they are used for accessing the results stored on the PubMed History Server. The returned list also contains a Count value (note that Count is a character) that informs about the total number of results the query produced.

Example. In order to retrieve citations about “p53” published from laboratories located in Chicago, IL in 2019, we can run the following lines of code.


myQuery <- "p53 AND Chicago[Affiliation] AND 2019[PDAT]" 
myIdList <- get_pubmed_ids(myQuery)

#the query produced the following number of results

#this is the unique WebEnv String

#the PubMed ID of the first record produced by this query is the following

#open the PubMed abstract corresponding to the latest record in a new window of the browser
httr::BROWSE(paste("http://www.ncbi.nlm.nih.gov/pubmed/", myIdList$IdList[[1]], sep =""))

fetch_pubmed_data() and table_articles_byAuth()
fetch_pubmed_data() retrieves data from the PubMed history server in XML format via the eFetch utility. The only required argument of the fetch_pubmed_data() function is a list containing a QueryKey value and a WebEnv value. Typically, this is the list resulting from a get_pubmed_ids() call. fetch_pubmed_data() will return a XMLInternalDocument-class object including the first 500 records retruned by the PubMed query. In order to retrieve a different number of records we need to specify the following optional arguments:

  • retstart, is an integer and defines the position of the first item to be retrieved by the fetch_pubmed_data() function.
  • retmax, is an integer and defines the number of total items to be fetched.

Even when a large number of records has to be fetched, it is recommended to download records in batches of 500 to 1000 items per time. The maximum number of records that can be fetched at one time is 5000. Note: easyPubMed DOES NOT return results as XMLInternalDocument-class objects anymore. Results are always returned as one or more strings. We can fetch records in XML (text including XML tags), plain TXT, and other PubMed-supported formats. To extract specific XML fields of interest, we can rely on the custom_grep() function. To extract multiple fields at once, we can use table_articles_byAuth(). In the latter instance, fields are cast as data.frame. Each row corresonds to one author. We can extract the first author, or the last author. However, we can also extract all authors from a record. All other fields (DOI, itle, Journal, …) will be recycled for each author.

# fetch PubMed records
topRecords <- fetch_pubmed_data(myIdList)

# class is 'charachter'; length is '1'
#fetch the first 20 PubMed records
top20records <- fetch_pubmed_data(myIdList, retstart = 1, retmax = 20)

# Extract titles
myTitles <- custom_grep(xml_data = top20records, tag = "ArticleTitle", format = "char")

# Extract multiple fields from each PubMed record
allFields <- table_articles_byAuth(top20records, included_authors = "last", getKeywords = TRUE)

In the following real-world example, we are going to fetch all papers about “p53” published by laboratories located in Chicago (2010-2019). For each record, PubMed ID, Name of the first author and publication Title will be extracted, together with DOI and keywords. Results are saved as a csv file. All this can be accomplished using few lines of code.


myQuery <- 'p53 AND Chicago[Affiliation] AND ("2010/01/01"[PDAT] : "2019/12/31"[PDAT])' 
myIdList <- get_pubmed_ids(myQuery)

all_steps <- seq(1, myIdList$Count, by = 50)
results <- lapply(all_steps, function(i) {
  y <- fetch_pubmed_data(pubmed_id_list = myIdList, retmax = 50, retstart = i)  
  yy <- table_articles_byAuth(y, included_authors = "first", getKeywords = TRUE)
  yy[, c("pmid", "doi", "jabbrv", "lastname", "keywords")]

results <- do.call(rbind,results)

Download and extract fields
We can save PubMed data locally before extracting fields from each record. Below we included an example.

myQuery <- 'p53 AND Chicago[Affiliation] AND ("2010/01/01"[PDAT] : "2019/12/31"[PDAT])' 
fdt_files <- batch_pubmed_download(pubmed_query_string = myQuery,
                                   format = "xml",
                                   batch_size = 50,
                                   dest_file_prefix = "fdt",
                                   encoding = "UTF-8")

# File names

# Read files, extract fields, and then cast as data.frames
fdt_list <- lapply(fdt_files, table_articles_byAuth, 
                   included_authors = "last", getKeywords = TRUE) 
sapply(fdt_list, class)

# Aggregate
results <- do.call(rbind, fdt_list)

These are some simple examples to help you getting started with easyPubMed. Don’t heistate to post comments or email me at damiano DOT fantini AT gmail DOT com with questions, concerns, and suggestions.  Thanks.

    1. Damiano (Post author)

      Hi Tyler, you can query for multiple authors at the same time. However, you cannot query a thousand of authors in the same query. The query is passed via a GET method, which means that it will passed to the PubMed server as part of the URL. In my hands, you can query about 100 names at the same time. Below, you can find an example.

      my_query < - "Immune AND Chicago AND 2017[PDAT]" # # Query pubmed and fetch the results my_query <- get_pubmed_ids(my_query) my_abstracts_xml <- fetch_pubmed_data(my_query, retmax = 1000) my_abstracts_list <- articles_to_list(my_abstracts_xml) # # Process each PubMed record to extract names and last names # Use do.call() to combine everything in a data.frame # Note that this will take few minutes (3-6m). my_auth_list <- lapply(my_abstracts_list, article_to_df, autofill = F, max_chars = 0) my_auth_list <- do.call(rbind, my_auth_list) # # An excerpt of what you got my_auth_list[1:10, c("pmid", "lastname", "firstname")] # # Now, let's recursively query all these authors at once. We concatenate # (Last_name First_Name[AU] AND 2017[PDAT]) # AU is a filter for author, PDAT for published date # allAuths <- list() for (i in 1:nrow(my_auth_list)) { getNM <- gsub("^[[:space:]]+|[[:space:]]+$", "", my_auth_list[i, c("lastname", "firstname")]) if (sum(nchar(getNM)<1) == 0){ fullNM <- paste(paste(getNM, collapse = " "), "[AU]", sep = "", collapse = "") allAuths[[fullNM]] <- 1 } } # # We are attempting 5382 authors at once. Well, it won't work... length(allAuths) # # Finalize mega query. How many authors can I query at once? # # 10: yes! megaQuery <- paste(names(allAuths)[1:10], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job01 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0010") # # 50: yes! megaQuery <- paste(names(allAuths)[1:50], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job02 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0050") # # 150: maybe... megaQuery <- paste(names(allAuths)[1:150], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job03 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0150")

      # Don't attempt more authors at the same time, it won't work...

      Hope this helps. Thank you!

    1. Damiano (Post author)

      Hi Athul,
      thanks for letting me know. I forgot to update this page after I updated the easyPubMed package some time back. The error was raised by the fetch_pubmed_data() function. The first argument should now read pubmed_id_list and not pubmedIdList. I updated the wrong line in the script above, and the correct code is now:

      recordsXml < - fetch_pubmed_data(pubmed_id_list = myIdList, retstart = i, retmax = myRetmax) [...]

      The code above should now work out-of-the-box (I double checked, it runs smoothly on my system). Again, thanks for pointing this error out!

    1. Damiano (Post author)

      This is now part of the official easyPubMed release. Let me know if you need help with this.

    1. Damiano (Post author)

      # Load package
      library(easyPubMed, quietly = TRUE)
      # Let's say you have the following PubMed ID, and you want to retrieve the corresponding abstract
      my_query_string < - "30446446[PMID]" # # First, query Entrez my_query <- get_pubmed_ids(my_query_string) # # Then, fetch the records xml_record <- fetch_pubmed_data(my_query) # # Finally, extract info (including the abstract) final_df <- easyPubMed::table_articles_byAuth(pubmed_data = xml_record, # your input max_chars = -1, # fetch the whole abstract included_authors = 'last') # one author will suffice # # Now, you have your abstract in the 'abstract' slot print(final_df$abstract)

    1. Damiano (Post author)

      Hi! Sorry for the late reply… Since the last update, I implemented a custom approach to extract XML-tagged fields, so easyPubMed does not rely on the XML2 library anymore. In turn, I need to fix the code in the example, since -as you pointed out- it is outdated and now throws an error. I will fix this in the next few days. In the meanwhile, send me a message if you need help to get started with easyPubMed. Again, thanks for the post.

    Hi Damiano,

    Is it possible to get sample size reports? If not directly, maybe by specifying an inclusion of sentences in the abstracts that have “sample size” “study size” and “n=“ ?


    1. Damiano (Post author)

      Sure, there are a couple of ways to do that, we can test and see which one is the most efficient.

    I have a list of Pumed ID and DOI and i want to extract author affiliation, Email ID batch wise how can i do that

    1. Damiano (Post author)

      The upcoming version of easyPubMed will include a function designed to fetch records given a list of PMIDs… I’ll keep you posted.

    Hi Damiano
    could you please, I have a list of PMID, how can I extract DOI and affiliation with easypubmed?

    1. Damiano (Post author)

      Sure. There is a function for this in the newest version of the package, available on GitHub.

      # Install via devtools
      install_github("dami82/easyPubMed", force = TRUE, build_opts = NULL)

      Here, I assume that `my_pmids` is a character vector of PubMed identifiers, such as `”30197875″ “30190923” “30190919” “29793310” “29633980” “29599908”`. You should use `fetch_PMID_data` to retrieve the records, and then you can extract fields using `table_articles_byAuth` (as usual). The output includes all fields you are interested into.

      # get ID data
      all.DATA < - fetch_PMID_data(pmids = my_pmids) # extract data, make data.frame all.df <- table_articles_byAuth(pubmed_data = all.DATA, included_authors = "last", max_chars = 100) all.df[1:5,c(1, 5, 6, 7, 8)]

      Send me an email if you have other questions (or provide me with a valid email address in the form, so I can reach out to you).

    1. Damiano (Post author)

      In principle, this may be doable. The problem is that I do not see how/where MeSH terms are stored in a PubMed record. Specifically, I do not see MeSH fields in sample PubMed records. My library makes it easy to extract XML fields from pubmed records. If Mesh terms are not identified by a specific XML tag, it may be difficult to extract them. If you can provide me with some guidance, I recommend to bring this offline… Send me an email if you are interested.

    Love your package.
    I am trying to create a dataframe but ultimately I am getting the following error:

    {{n error occurred
    Error in tryCatch({ : object ‘final.mat’ not found}}

    here is my code

    fdt_xml <- batch_pubmed_download(pubmed_query_string = fdt_query,
    format = "xml",
    batch_size = 50,
    dest_file_prefix = "fdt",
    encoding = "ASCII")

    fdt_list <- lapply(fdt_xml,
    articles_to_list) # generate list of lists

    fdt_list <- flatten(fdt_list) # convert list of lists to a single list (Do I need to do this step??)

    fdt_df <- lapply(fdt_list,
    autofill = FALSE,
    max_chars = 1, # don't keep any of the abstract text
    getKeywords = TRUE) # just keep the keywords

    – I get the error….

    Can you help explain why?

    If I select just one item in the list, it is not a problem to run the following code..

    fdt_df <- lapply(fdt_list[1],
    autofill = FALSE,
    max_chars = 1, # don't keep any of the abstract text
    getKeywords = TRUE) # just keep the keywords

    Any help/advice much appreciated…


    1. Damiano (Post author)

      Hi Eric. Thanks for using easyPubMed.
      The error. Your code generate a list of character vectors (not a list of lists). In the following step, you are trying to apply article_to_df() over each element of the list. This means that article_to_df() receives a character vector of length=50 as argument. Unfortunately (as you pointed above), article_to_df() can only handle one string (character vector of length 1).
      Solution. THe example above has been updated to show how to handle data downloaded using batch_pubmed_download(). Briefly, you should just use table_articles_byAuth().
      I hope this may be helpful.
      Best regards.

    Thank you very much for this brilliant package. I have one small question. When I use the fetch_pubmed_data function it doesn’t take into account the retstart parameter. Indeed, the xml data obtained always start from 0 and stops at retmax (this parameter works). I was setting up a loop to fetch repetitively from the same entrez_id list object. Any idea why this is? Is there a way to extract only parts of the entrez_id obtained with the function get_pubmed_ids? I don’t seem to get my head around the structure of the entrez_id. The count is correct, but I don’t find a place where all the PMIDs are stored.

    1. Damiano (Post author)

      Hi Martin,
      thanks for using easyPubMed. If you want to download a large number of records, you may want to use the batch_pubmed_download() function. Alternatively, you can play around with the retmax/retstart arguments within a loop. I just put together a vignette about how to use retmax and retstart. It is available at this URL: https://www.data-pulse.com/projects/Rlibs/vignettes/easyPubMed_03_retmax_example.html. Let me know if this address your questions.

    Dear Damiano, congratulations on this amazing package, it has been extremely useful.

    Is there an easy way to download the XML files given a long list of PMIDs (n=120K)?

    I’ve seen batch_pubmed_download() can download a long list of PMIDs in batch writing multiple.XML files. However, this function starts with a query term but I did not see an option to input a list PMID instead of a search string. I tried with “123[PMID] OR 1234[PMID] OR 12345[PMID]..” but PubMed returns an error if this includes more than 5K PMIDs.

    Thanks in advance

    1. Damiano (Post author)

      You are raising a valid point here. I think I can easily implement this. I’ll work on this and keep you posted.

    1. Damiano (Post author)

      Hello Liam,
      Thanks for using easyPubMed. Currently, the package does not support extraction of citation counts. Unfortunately, citation info are not included in legacy PubMed records. This may change in the upcoming months if/when a new API is released by NCBI.
      Best regards.

    1. Damiano (Post author)

      Hi Andy,
      easyPubMed retrieves PubMed records only. Therefore, only genes mentioned in the title or abstract can be extracted.
      An issue with this analysis would be the lack of standardization with gene names. Some articles report official symbols, others may use full name or aliases.
      The analysis is doable, but you cannot expect to obtain exact results. This being said, it could be interesting to give this a try and run searches at regular intervals.
      Best regards,

