Querying PubMed via the easyPubMed package in R

PubMed (NCBI Entrez) is an online database of citations for biomedical literature that is available at the following URL: http://www.ncbi.nlm.nih.gov/pubmed. Retrieving data from PubMed is also possible in an automated way via the NCBI Entrez E-utilities. A description of how NBCI E-utilities work is available at the following URL: http://www.ncbi.nlm.nih.gov/books/NBK25501/.
easyPubMed is a R package I wrote that allows to easily download content from PubMed in XML format. easyPubMed includes 3 functions: get_pubmed_ids(), fetch_pubmed_data() and batch_pubmed_download().

get_pubmed_ids()
get_pubmed_ids() takes a character string as argument. This character string is just the text of the PubMed query we want to perform. You can use the same syntax you would use for a regular PubMed query. This function queries PubMed via the NCBI eSearch utility. The result of the query is saved on the PubMed History Server. This function returns a list that includes info about the query, the PubMed IDs of the first 20 results, a WebEnv string and a QueryKey string. These two strings are required by the fetch_pubmed_data() function, as they are used for accessing the results stored on the PubMed History Server. The returned list also contains a Count value (note that Count is a character) that informs about the total number of results the query produced.

Example. In order to retrieve citations about “p53” published from laboratories located in Chicago, IL, we can run the following lines of code.

library(easyPubMed)
library(httr) #for the BROWSE() function

myQuery <- "p53 AND Chicago[Affiliation]" 
myIdList <- get_pubmed_ids(myQuery)

#the query produced the following number of results
as.integer(as.character(myIdList$Count))

#this is the unique WebEnv String
myIdList$WebEnv

#the PubMed ID of the first record produced by this query is the following
myIdList$IdList[[1]]

#open the PubMed abstract corresponding to the latest record in a new window of the browser
BROWSE(paste("http://www.ncbi.nlm.nih.gov/pubmed/", myIdList$IdList[[1]], sep =""))

fetch_pubmed_data()
fetch_pubmed_data() retrieves data from the PubMed history server in XML format via the eFetch utility. The only required argument of the fetch_pubmed_data() function is a list containing a QueryKey value and a WebEnv value. Typically, this is the list resulting from a get_pubmed_ids() call. fetch_pubmed_data() will return a XMLInternalDocument-class object including the first 500 records retruned by the PubMed query. In order to retrieve a different number of records we need to specify the following optional arguments:

  • retstart, is an integer and defines the position of the first item to be retrieved by the fetch_pubmed_data() function.
  • retmax, is an integer and defines the number of total items to be fetched.

Even when a large number of records has to be fetched, it is recommended to download records in batches of 500 to 1000 items per time. The maximum number of records that can be fetched at one time is 5000.
The resulting XMLInternalDocument-class object can be handled by the functions of the XML package. For example, we can use the XML::xpathApply() function to slice the XML object in individual nodes. Each PubMed record is included within <PubmedArticle> tags. Likewise, article titles are comprised within <ArticleTitle> tags. Therefore, we can extract record titles by running the following lines of code (note that xpathApply() generates a list of XMLInternalNode-class objects unless we define fun = saveXML).

library(XML)

#fetch the first 500 PubMed records
top500records <- fetch_pubmed_data(myIdList)

#fetch the first 200 PubMed records
top200records <- fetch_pubmed_data(myIdList, retstart = 0, retmax = 200)

myTitles <- unlist(xpathApply(top200records, "//ArticleTitle", saveXML))
myTitles <- gsub("(<ArticleTitle>)|(</ArticleTitle>)", "", myTitles)
head(myTitles)

In the following real-world example, we are going to fetch all papers about “p53” published by laboratories located in Chicago. For each record, PubMed ID, Name of the first author and publication Title will be extracted, added in a data frame and saved as a csv file. You can find this example on GitHub at the following URL: https://github.com/dami82/datasci/blob/master/easyPubMed.R

library(XML)
library(easyPubMed)

myQuery <- "p53 AND Chicago[Affiliation]" 
myIdList <- get_pubmed_ids(myQuery)

myRetstart <- 0
myRetmax <- 200

myStarts <- seq(myRetstart, as.integer(as.character(myIdList$Count)), by= myRetmax)

myResult <- do.call(rbind, lapply(myStarts, (function(i){
  
  recordsXml <- fetch_pubmed_data(pubmed_id_list = myIdList, retstart = i, retmax = myRetmax)
  
  # each article is included within a <PubmedArticle> tag
  # fun = saveXML returns a list of character strings instead of InternalNode-class objects
  recordList <- xpathApply(recordsXml, "//PubmedArticle", saveXML)
  
  
  tmpDF <- t(as.data.frame(lapply(1:length(recordList), (function(x){
   
    titlePosition <- regexpr("(<ArticleTitle).+(\\/ArticleTitle>)", recordList[[x]])
    tmpTitle <- substr(recordList[[x]], titlePosition, 
                       titlePosition + attributes(titlePosition)$match.length)
    tmpTitle <- gsub("<ArticleTitle>|</ArticleTitle>|([[:space:]]$)", "", tmpTitle)
    #tmpTitle
    
    pubmedIdPosition <- regexpr("(<PMID).+(\\/PMID>)", recordList[[x]])
    tmpPMID <- substr(recordList[[x]], pubmedIdPosition, 
                      pubmedIdPosition + attributes(pubmedIdPosition)$match.length)
    tmpPMID <- gsub("<PMID|<\\/PMID>|[[:space:]]", "", tmpPMID)
    tmpPMID <- gsub("^.*>", "", tmpPMID)
    #tmpPMID
    
    tmpAuthors <- strsplit(recordList[[x]], "<AuthorList")[[1]][[2]]
    tmpFirstAuthor <- strsplit(tmpAuthors, "<Author")[[1]][[2]]
    lastNamePos <- regexpr("(<LastName).*(\\/LastName>)",tmpFirstAuthor)
    lastName <- substr(tmpFirstAuthor, lastNamePos, 
                       lastNamePos + attributes(lastNamePos)$match.length)
    lastName <- gsub("<LastName|<\\/LastName>|([[:space:]]$)", "", lastName)
    lastName <- gsub("^.*>", "", lastName)
    #lastName
    
    firstNamePos <- regexpr("(<ForeName).*(\\/ForeName>)",tmpFirstAuthor)
    firstName <- substr(tmpFirstAuthor, firstNamePos, 
                        firstNamePos + attributes(firstNamePos)$match.length)
    firstName <- gsub("<ForeName|<\\/ForeName>|([[:space:]]$)", "", firstName)
    firstName <- gsub("^.*>", "", firstName)
    #firstName
    
    tmpName <- paste(firstName, lastName, sep = " ")
    
    #return
    c(tmpPMID, tmpName, tmpTitle)
  }))))
  
  rownames(tmpDF) <- NULL
  tmpDF
  
})))

colnames(myResult) <- c("PMID", "Author", "Title")
head(myResult)
write.csv(myResult, "p53_papers_Chicago.csv")

About Author

Damiano
Postdoc Research Fellow at Northwestern University (Chicago)

10 Comments

  1. Tyler

    Damiano,

    I am very new to R and extremely impressed by your EasyPubMED tool. Is there a way that I could search several thousand names instead of just one name at a time.

    Thank you for your help.

    Reply
    1. Damiano (Post author)

      Hi Tyler, you can query for multiple authors at the same time. However, you cannot query a thousand of authors in the same query. The query is passed via a GET method, which means that it will passed to the PubMed server as part of the URL. In my hands, you can query about 100 names at the same time. Below, you can find an example.


      library(easyPubMed)
      my_query < - "Immune AND Chicago AND 2017[PDAT]" # # Query pubmed and fetch the results my_query <- get_pubmed_ids(my_query) my_abstracts_xml <- fetch_pubmed_data(my_query, retmax = 1000) my_abstracts_list <- articles_to_list(my_abstracts_xml) # # Process each PubMed record to extract names and last names # Use do.call() to combine everything in a data.frame # Note that this will take few minutes (3-6m). my_auth_list <- lapply(my_abstracts_list, article_to_df, autofill = F, max_chars = 0) my_auth_list <- do.call(rbind, my_auth_list) # # An excerpt of what you got my_auth_list[1:10, c("pmid", "lastname", "firstname")] # # Now, let's recursively query all these authors at once. We concatenate # (Last_name First_Name[AU] AND 2017[PDAT]) # AU is a filter for author, PDAT for published date # allAuths <- list() for (i in 1:nrow(my_auth_list)) { getNM <- gsub("^[[:space:]]+|[[:space:]]+$", "", my_auth_list[i, c("lastname", "firstname")]) if (sum(nchar(getNM)<1) == 0){ fullNM <- paste(paste(getNM, collapse = " "), "[AU]", sep = "", collapse = "") allAuths[[fullNM]] <- 1 } } # # We are attempting 5382 authors at once. Well, it won't work... length(allAuths) # # Finalize mega query. How many authors can I query at once? # # 10: yes! megaQuery <- paste(names(allAuths)[1:10], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job01 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0010") # # 50: yes! megaQuery <- paste(names(allAuths)[1:50], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job02 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0050") # # 150: maybe... megaQuery <- paste(names(allAuths)[1:150], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job03 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0150")

      #
      # Don't attempt more authors at the same time, it won't work...

      Hope this helps. Thank you!

      Reply
  2. Athul

    Tried to run the same copy of the real world example. Got some errors.

    Error in fetch_pubmed_data(pubmedIdList = myIdList, retstart = i, retmax = myRetmax) :
    unused argument (pubmedIdList = myIdList)
    Calls: do.call -> lapply -> FUN -> fetch_pubmed_data
    Execution halted

    Can you please tell why is this so?

    Reply
    1. Damiano (Post author)

      Hi Athul,
      thanks for letting me know. I forgot to update this page after I updated the easyPubMed package some time back. The error was raised by the fetch_pubmed_data() function. The first argument should now read pubmed_id_list and not pubmedIdList. I updated the wrong line in the script above, and the correct code is now:

      [...]
      recordsXml < - fetch_pubmed_data(pubmed_id_list = myIdList, retstart = i, retmax = myRetmax) [...]

      The code above should now work out-of-the-box (I double checked, it runs smoothly on my system). Again, thanks for pointing this error out!

      Reply
      1. Athul

        Thank you so much for your quick response. It worked perfectly! This is a great tool.

  3. Severin

    Hi Damiano,
    is it possible to extract the Keywords from PubMed XML Files (I am not familiar working with XML files).

    Thank you for your response.
    Severin

    Reply
  4. koushik

    Hi Damiano,

    Thanks for amazing post.

    Along with the title, PubMed, Author name I would also like to get email ID or address. Could you kindly, share the code, please.

    Looking forward to hearing from you.

    thanks

    Reply
    1. Damiano (Post author)

      This is now part of the official easyPubMed release. Let me know if you need help with this.

      Reply
  5. Adhip

    hello Damiano!!
    can you please explain a way to fetch a whole abstract for a particular PMID using easyPubMed??

    Reply
    1. Damiano (Post author)


      # Load package
      library(easyPubMed, quietly = TRUE)
      #
      # Let's say you have the following PubMed ID, and you want to retrieve the corresponding abstract
      my_query_string < - "30446446[PMID]" # # First, query Entrez my_query <- get_pubmed_ids(my_query_string) # # Then, fetch the records xml_record <- fetch_pubmed_data(my_query) # # Finally, extract info (including the abstract) final_df <- easyPubMed::table_articles_byAuth(pubmed_data = xml_record, # your input max_chars = -1, # fetch the whole abstract included_authors = 'last') # one author will suffice # # Now, you have your abstract in the 'abstract' slot print(final_df$abstract)

      Reply

Leave a Comment

Your email address will not be published. Required fields are marked *