Querying PubMed via the easyPubMed package in R

Damiano January 5, 2016 36 Comments

PubMed (NCBI Entrez) is an online database of citations for biomedical literature that is available at the following URL: http://www.ncbi.nlm.nih.gov/pubmed. Retrieving data from PubMed is also possible in an automated way via the NCBI Entrez E-utilities. A description of how NBCI E-utilities work is available at the following URL: http://www.ncbi.nlm.nih.gov/books/NBK25501/.
easyPubMed is a R package I wrote that allows to easily download content from PubMed in XML format. easyPubMed includes 3 functions: get_pubmed_ids(), fetch_pubmed_data() and batch_pubmed_download().

Note! This post discusses the use of easyPubMed version 2.23 or lower. The easyPubMed library has significantly evolved since this content was first made available. While the code below may NOT be compatible with the latest version of easyPubMed, this post is maintained in its original format for legacy purposes. For more info about the latest version of easyPubMed, please see: https://www.data-pulse.com/dev_site/easypubmed/.

Get easyPubMed

You can download the latest version of easyPubMed (dev version) from GitHub. A convenient way to install a package from GitHub is to use remotes. You can also obtain a specific branch/version of interest (here, version 2.23).

remotes::install_github(repo = “dami82/easyPubMed”, force = TRUE, build_vignettes = TRUE, ref = “dev_2_23”)

get_pubmed_ids()
get_pubmed_ids() takes a character string as argument. This character string is just the text of the PubMed query we want to perform. You can use the same syntax you would use for a regular PubMed query. This function queries PubMed via the NCBI eSearch utility. The result of the query is saved on the PubMed History Server. This function returns a list that includes info about the query, the PubMed IDs of the first 20 results, a WebEnv string and a QueryKey string. These two strings are required by the fetch_pubmed_data() function, as they are used for accessing the results stored on the PubMed History Server. The returned list also contains a Count value (note that Count is a character) that informs about the total number of results the query produced.

Example. In order to retrieve citations about “p53” published from laboratories located in Chicago, IL in 2019, we can run the following lines of code.

library(easyPubMed)
library(httr)

myQuery <- "p53 AND Chicago[Affiliation] AND 2019[PDAT]" 
myIdList <- get_pubmed_ids(myQuery)

#the query produced the following number of results
as.integer(as.character(myIdList$Count))

#this is the unique WebEnv String
myIdList$WebEnv

#the PubMed ID of the first record produced by this query is the following
myIdList$IdList[[1]]

#open the PubMed abstract corresponding to the latest record in a new window of the browser
httr::BROWSE(paste("http://www.ncbi.nlm.nih.gov/pubmed/", myIdList$IdList[[1]], sep =""))

fetch_pubmed_data() and table_articles_byAuth()
fetch_pubmed_data() retrieves data from the PubMed history server in XML format via the eFetch utility. The only required argument of the fetch_pubmed_data() function is a list containing a QueryKey value and a WebEnv value. Typically, this is the list resulting from a get_pubmed_ids() call. fetch_pubmed_data() will return a XMLInternalDocument-class object including the first 500 records retruned by the PubMed query. In order to retrieve a different number of records we need to specify the following optional arguments:

retstart, is an integer and defines the position of the first item to be retrieved by the fetch_pubmed_data() function.
retmax, is an integer and defines the number of total items to be fetched.

Even when a large number of records has to be fetched, it is recommended to download records in batches of 500 to 1000 items per time. The maximum number of records that can be fetched at one time is 5000. Note: easyPubMed DOES NOT return results as XMLInternalDocument-class objects anymore. Results are always returned as one or more strings. We can fetch records in XML (text including XML tags), plain TXT, and other PubMed-supported formats. To extract specific XML fields of interest, we can rely on the custom_grep() function. To extract multiple fields at once, we can use table_articles_byAuth(). In the latter instance, fields are cast as data.frame. Each row corresonds to one author. We can extract the first author, or the last author. However, we can also extract all authors from a record. All other fields (DOI, itle, Journal, …) will be recycled for each author.

# fetch PubMed records
topRecords <- fetch_pubmed_data(myIdList)

# class is 'charachter'; length is '1'
class(topRecords)
length(topRecords)
#fetch the first 20 PubMed records
top20records <- fetch_pubmed_data(myIdList, retstart = 1, retmax = 20)

# Extract titles
myTitles <- custom_grep(xml_data = top20records, tag = "ArticleTitle", format = "char")
head(myTitles)

# Extract multiple fields from each PubMed record
allFields <- table_articles_byAuth(top20records, included_authors = "last", getKeywords = TRUE)
head(allFields)

In the following real-world example, we are going to fetch all papers about “p53” published by laboratories located in Chicago (2010-2019). For each record, PubMed ID, Name of the first author and publication Title will be extracted, together with DOI and keywords. Results are saved as a csv file. All this can be accomplished using few lines of code.

library(easyPubMed)

myQuery <- 'p53 AND Chicago[Affiliation] AND ("2010/01/01"[PDAT] : "2019/12/31"[PDAT])' 
myIdList <- get_pubmed_ids(myQuery)

all_steps <- seq(1, myIdList$Count, by = 50)
results <- lapply(all_steps, function(i) {
  y <- fetch_pubmed_data(pubmed_id_list = myIdList, retmax = 50, retstart = i)  
  yy <- table_articles_byAuth(y, included_authors = "first", getKeywords = TRUE)
  yy[, c("pmid", "doi", "jabbrv", "lastname", "keywords")]
})

results <- do.call(rbind,results)
nrow(results)
head(results)

Download and extract fields
We can save PubMed data locally before extracting fields from each record. Below we included an example.

myQuery <- 'p53 AND Chicago[Affiliation] AND ("2010/01/01"[PDAT] : "2019/12/31"[PDAT])' 
fdt_files <- batch_pubmed_download(pubmed_query_string = myQuery,
                                   format = "xml",
                                   batch_size = 50,
                                   dest_file_prefix = "fdt",
                                   encoding = "UTF-8")

# File names
head(fdt_files)

# Read files, extract fields, and then cast as data.frames
fdt_list <- lapply(fdt_files, table_articles_byAuth, 
                   included_authors = "last", getKeywords = TRUE) 
class(fdt_list)
sapply(fdt_list, class)

# Aggregate
results <- do.call(rbind, fdt_list)
head(results)

These are some simple examples to help you getting started with easyPubMed. Don’t heistate to post comments or email me at damiano DOT fantini AT gmail DOT com with questions, concerns, and suggestions. Thanks.

About Author

Damiano
Postdoc Research Fellow at Northwestern University (Chicago)

36 Comments

Tyler May 18, 2018 at 3:58 pm

Damiano,

I am very new to R and extremely impressed by your EasyPubMED tool. Is there a way that I could search several thousand names instead of just one name at a time.

Thank you for your help.

Reply ↓
1. Damiano (Post author)May 26, 2018 at 1:23 am
  
  Hi Tyler, you can query for multiple authors at the same time. However, you cannot query a thousand of authors in the same query. The query is passed via a GET method, which means that it will passed to the PubMed server as part of the URL. In my hands, you can query about 100 names at the same time. Below, you can find an example.
  
  library(easyPubMed) my_query < - "Immune AND Chicago AND 2017[PDAT]" # # Query pubmed and fetch the results my_query <- get_pubmed_ids(my_query) my_abstracts_xml <- fetch_pubmed_data(my_query, retmax = 1000) my_abstracts_list <- articles_to_list(my_abstracts_xml) # # Process each PubMed record to extract names and last names # Use do.call() to combine everything in a data.frame # Note that this will take few minutes (3-6m). my_auth_list <- lapply(my_abstracts_list, article_to_df, autofill = F, max_chars = 0) my_auth_list <- do.call(rbind, my_auth_list) # # An excerpt of what you got my_auth_list[1:10, c("pmid", "lastname", "firstname")] # # Now, let's recursively query all these authors at once. We concatenate # (Last_name First_Name[AU] AND 2017[PDAT]) # AU is a filter for author, PDAT for published date # allAuths <- list() for (i in 1:nrow(my_auth_list)) { getNM <- gsub("^[[:space:]]+|[[:space:]]+$", "", my_auth_list[i, c("lastname", "firstname")]) if (sum(nchar(getNM)<1) == 0){ fullNM <- paste(paste(getNM, collapse = " "), "[AU]", sep = "", collapse = "") allAuths[[fullNM]] <- 1 } } # # We are attempting 5382 authors at once. Well, it won't work... length(allAuths) # # Finalize mega query. How many authors can I query at once? # # 10: yes! megaQuery <- paste(names(allAuths)[1:10], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job01 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0010") # # 50: yes! megaQuery <- paste(names(allAuths)[1:50], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job02 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0050") # # 150: maybe... megaQuery <- paste(names(allAuths)[1:150], collapse = " OR ") megaQuery <- paste("(", megaQuery, ")", " AND 2017[PDAT]", sep = "") job03 <- easyPubMed::batch_pubmed_download(megaQuery, dest_file_prefix = "job_0150")
  #
  # Don't attempt more authors at the same time, it won't work...
  
  Hope this helps. Thank you!
  
  Reply ↓
Athul September 4, 2018 at 12:04 pm

Tried to run the same copy of the real world example. Got some errors.

Error in fetch_pubmed_data(pubmedIdList = myIdList, retstart = i, retmax = myRetmax) :
unused argument (pubmedIdList = myIdList)
Calls: do.call -> lapply -> FUN -> fetch_pubmed_data
Execution halted

Can you please tell why is this so?

Reply ↓
1. Damiano (Post author)September 4, 2018 at 5:57 pm
  
  Hi Athul,
  thanks for letting me know. I forgot to update this page after I updated the easyPubMed package some time back. The error was raised by the fetch_pubmed_data() function. The first argument should now read pubmed_id_list and not pubmedIdList. I updated the wrong line in the script above, and the correct code is now:
  
  [...] recordsXml < - fetch_pubmed_data(pubmed_id_list = myIdList, retstart = i, retmax = myRetmax) [...]
  
  The code above should now work out-of-the-box (I double checked, it runs smoothly on my system). Again, thanks for pointing this error out!
  
  Reply ↓
  1. Athul September 5, 2018 at 4:12 am
    
    Thank you so much for your quick response. It worked perfectly! This is a great tool.
Severin November 19, 2018 at 8:42 am

Hi Damiano,
is it possible to extract the Keywords from PubMed XML Files (I am not familiar working with XML files).

Thank you for your response.
Severin

Reply ↓
koushik December 20, 2018 at 6:12 am

Hi Damiano,

Thanks for amazing post.

Along with the title, PubMed, Author name I would also like to get email ID or address. Could you kindly, share the code, please.

Looking forward to hearing from you.

thanks

Reply ↓
1. Damiano (Post author)May 22, 2019 at 3:20 am
  
  This is now part of the official easyPubMed release. Let me know if you need help with this.
  
  Reply ↓
  1. koushik March 31, 2020 at 7:02 am
    
    Thanks, Damiano. I just got a chance to review the code. Awesome, it is working great.
Adhip May 22, 2019 at 2:02 am

hello Damiano!!
can you please explain a way to fetch a whole abstract for a particular PMID using easyPubMed??

Reply ↓
1. Damiano (Post author)May 22, 2019 at 3:24 am
  
  # Load package library(easyPubMed, quietly = TRUE) # # Let's say you have the following PubMed ID, and you want to retrieve the corresponding abstract my_query_string < - "30446446[PMID]" # # First, query Entrez my_query <- get_pubmed_ids(my_query_string) # # Then, fetch the records xml_record <- fetch_pubmed_data(my_query) # # Finally, extract info (including the abstract) final_df <- easyPubMed::table_articles_byAuth(pubmed_data = xml_record, # your input max_chars = -1, # fetch the whole abstract included_authors = 'last') # one author will suffice # # Now, you have your abstract in the 'abstract' slot print(final_df$abstract)
  
  Reply ↓
Raoul May 29, 2019 at 5:23 pm

Hi Damiano,
I tried to run your code but it dosen’t work. The error message is:
Error in UseMethod(“xpathApply”) :
no applicable for method ‘xpathApply’ applied to an Object “character”

Thanks for your help!

Greats, Raoul

Reply ↓
1. Damiano (Post author)November 26, 2019 at 5:00 pm
  
  Hi! Sorry for the late reply… Since the last update, I implemented a custom approach to extract XML-tagged fields, so easyPubMed does not rely on the XML2 library anymore. In turn, I need to fix the code in the example, since -as you pointed out- it is outdated and now throws an error. I will fix this in the next few days. In the meanwhile, send me a message if you need help to get started with easyPubMed. Again, thanks for the post.
  
  Reply ↓
Li November 23, 2019 at 7:39 am

Hi Damiano,

Is it possible to get sample size reports? If not directly, maybe by specifying an inclusion of sentences in the abstracts that have “sample size” “study size” and “n=“ ?

Thanks!

Reply ↓
1. Damiano (Post author)November 26, 2019 at 4:44 pm
  
  Sure, there are a couple of ways to do that, we can test and see which one is the most efficient.
  
  Reply ↓
Bhabatosh December 10, 2019 at 1:39 pm

I have a list of Pumed ID and DOI and i want to extract author affiliation, Email ID batch wise how can i do that

Reply ↓
1. Damiano (Post author)December 10, 2019 at 3:09 pm
  
  The upcoming version of easyPubMed will include a function designed to fetch records given a list of PMIDs… I’ll keep you posted.
  
  Reply ↓
mojgan December 12, 2019 at 4:48 pm

Hi Damiano
could you please, I have a list of PMID, how can I extract DOI and affiliation with easypubmed?

Reply ↓
1. Damiano (Post author)December 13, 2019 at 2:11 pm
  
  Sure. There is a function for this in the newest version of the package, available on GitHub.
  
  # Install via devtools library(devtools) install_github("dami82/easyPubMed", force = TRUE, build_opts = NULL)
  
  Here, I assume that `my_pmids` is a character vector of PubMed identifiers, such as `”30197875″ “30190923” “30190919” “29793310” “29633980” “29599908”`. You should use `fetch_PMID_data` to retrieve the records, and then you can extract fields using `table_articles_byAuth` (as usual). The output includes all fields you are interested into.
  
  # get ID data all.DATA < - fetch_PMID_data(pmids = my_pmids) # extract data, make data.frame all.df <- table_articles_byAuth(pubmed_data = all.DATA, included_authors = "last", max_chars = 100) all.df[1:5,c(1, 5, 6, 7, 8)]
  
  Send me an email if you have other questions (or provide me with a valid email address in the form, so I can reach out to you).
  
  Reply ↓
Stephanie January 8, 2020 at 5:50 am

Hi Damien,

Is it possible to get the MeSH terms in a column? getKeywords=T only gets the keywords, not MeSH.

Thanks,
Stephanie

Reply ↓
1. Damiano (Post author)February 12, 2020 at 5:27 am
  
  In principle, this may be doable. The problem is that I do not see how/where MeSH terms are stored in a PubMed record. Specifically, I do not see MeSH fields in sample PubMed records. My library makes it easy to extract XML fields from pubmed records. If Mesh terms are not identified by a specific XML tag, it may be difficult to extract them. If you can provide me with some guidance, I recommend to bring this offline… Send me an email if you are interested.
  
  Reply ↓
Eric February 8, 2020 at 8:12 am

Hi Damiano,

Love your package.
I am trying to create a dataframe but ultimately I am getting the following error:

{{n error occurred
Error in tryCatch({ : object ‘final.mat’ not found}}

here is my code

fdt_xml <- batch_pubmed_download(pubmed_query_string = fdt_query,
format = "xml",
batch_size = 50,
dest_file_prefix = "fdt",
encoding = "ASCII")

fdt_list <- lapply(fdt_xml,
articles_to_list) # generate list of lists

fdt_list <- flatten(fdt_list) # convert list of lists to a single list (Do I need to do this step??)

fdt_df <- lapply(fdt_list,
article_to_df,
autofill = FALSE,
max_chars = 1, # don't keep any of the abstract text
getKeywords = TRUE) # just keep the keywords

– I get the error….

Can you help explain why?

If I select just one item in the list, it is not a problem to run the following code..

fdt_df <- lapply(fdt_list[1],
article_to_df,
autofill = FALSE,
max_chars = 1, # don't keep any of the abstract text
getKeywords = TRUE) # just keep the keywords

Any help/advice much appreciated…

Thanks!

Reply ↓
1. Damiano (Post author)February 12, 2020 at 5:11 am
  
  Hi Eric. Thanks for using easyPubMed.
  The error. Your code generate a list of character vectors (not a list of lists). In the following step, you are trying to apply article_to_df() over each element of the list. This means that article_to_df() receives a character vector of length=50 as argument. Unfortunately (as you pointed above), article_to_df() can only handle one string (character vector of length 1).
  Solution. THe example above has been updated to show how to handle data downloaded using batch_pubmed_download(). Briefly, you should just use table_articles_byAuth().
  I hope this may be helpful.
  Best regards.
  
  Reply ↓
Eric Bergh February 17, 2020 at 8:08 pm

Damiano,

Apologies for the later reply, was traveling abroad.
Using table_articles_byAuth() solved the problem!
Works amazing, love the package, thanks so much.
-Eric

Reply ↓
Chunhui Cai April 3, 2020 at 7:14 pm

Dear Damiano,

Your package is great! Can I use it to bulk download all papers with title and abstract from Pubmed or at least for the past 15 years? Your advise will be highly appreciated!

Best,
Chunhui

Reply ↓
Martin LARSEN April 19, 2020 at 11:11 am

Thank you very much for this brilliant package. I have one small question. When I use the fetch_pubmed_data function it doesn’t take into account the retstart parameter. Indeed, the xml data obtained always start from 0 and stops at retmax (this parameter works). I was setting up a loop to fetch repetitively from the same entrez_id list object. Any idea why this is? Is there a way to extract only parts of the entrez_id obtained with the function get_pubmed_ids? I don’t seem to get my head around the structure of the entrez_id. The count is correct, but I don’t find a place where all the PMIDs are stored.

Reply ↓
1. Damiano (Post author)April 19, 2020 at 11:58 pm
  
  Hi Martin,
  thanks for using easyPubMed. If you want to download a large number of records, you may want to use the batch_pubmed_download() function. Alternatively, you can play around with the retmax/retstart arguments within a loop. I just put together a vignette about how to use retmax and retstart. It is available at this URL: https://www.data-pulse.com/projects/Rlibs/vignettes/easyPubMed_03_retmax_example.html. Let me know if this address your questions.
  
  Reply ↓
Ferran May 12, 2020 at 5:47 pm

Dear Damiano, congratulations on this amazing package, it has been extremely useful.

Is there an easy way to download the XML files given a long list of PMIDs (n=120K)?

I’ve seen batch_pubmed_download() can download a long list of PMIDs in batch writing multiple.XML files. However, this function starts with a query term but I did not see an option to input a list PMID instead of a search string. I tried with “123[PMID] OR 1234[PMID] OR 12345[PMID]..” but PubMed returns an error if this includes more than 5K PMIDs.

Thanks in advance

Reply ↓
1. Damiano (Post author)May 14, 2020 at 3:55 pm
  
  You are raising a valid point here. I think I can easily implement this. I’ll work on this and keep you posted.
  
  Reply ↓
Liam June 29, 2020 at 2:21 pm

Hi Damiano,
I’ve been exploring your package and it is wonderful. Thank you so much.

I was just wondering if your package can extract citation count of a paper? I.e. the total number of times a paper has been cited by other papers.

Regards,
Liam

Reply ↓
1. Damiano (Post author)June 29, 2020 at 3:17 pm
  
  Hello Liam,
  Thanks for using easyPubMed. Currently, the package does not support extraction of citation counts. Unfortunately, citation info are not included in legacy PubMed records. This may change in the upcoming months if/when a new API is released by NCBI.
  Best regards.
  
  Reply ↓
  1. Tom Lahey November 28, 2020 at 2:49 am
    
    FYI, it might be a long time still for a new API (see latest note) Message-ID:
    
    As part of the ongoing release of the new PubMed interface, today we are beginning the process of removing the legacy Web interface for PubMed from the NCBI website. Despite the retirement of the legacy Web interface, the E-utilities will remain unchanged and will continue to support PubMed as they have done. For API users, no action is needed. We are in the early planning stages of a new RESTful API for PubMed, and will publish progress updates as they occur. We would expect any new API to function alongside the E-utility interface to PubMed for an extended time before announcing any changes to the E-utilities, and our policy is to notify the public at least six months in advance of any such changes.
Andy December 15, 2020 at 3:41 pm

Hi Damiano,

Thanks for putting this together and responding to queries – this is super helpful. I’m looking to search pubmed and from this pull out a pretty standard set (title, abstract, first author, year, journal…) but also wondered if you can extract a list of genes from each paper?

The idea is that we could maybe look at the frequencies of different genes appearing in the results list, and then decide to look at papers identifying gene X to see what the implications are.

Is this something which could be extracted from keywords/pubmed records? If you have any thoughts I’d be very grateful!

Reply ↓
1. Damiano (Post author)February 8, 2021 at 6:07 am
  
  Hi Andy,
  easyPubMed retrieves PubMed records only. Therefore, only genes mentioned in the title or abstract can be extracted.
  An issue with this analysis would be the lack of standardization with gene names. Some articles report official symbols, others may use full name or aliases.
  The analysis is doable, but you cannot expect to obtain exact results. This being said, it could be interesting to give this a try and run searches at regular intervals.
  Best regards,
  Damiano
  
  Reply ↓
Theresa February 8, 2021 at 2:38 am

Hi, I’m a student learning about extracting data from PubMed and I came across your guidance. I tried your real life example, but it is taking a long time to run. I get “Processing PubMed data data….done!” but the process never completes. I never get the results how long should it take to run?

Reply ↓
1. Damiano (Post author)February 8, 2021 at 6:02 am
  
  Hi Theresa.
  Difficult to troubleshoot without more info about the exact code you used and your system set-up. AS a general rule, make sure to use a Ethernet connection and a computer with enough memory. Also, try to use a Unix or Linux system if possible (note: you can always run the analysis in the cloud or use docker or a VM).
  Feel free to email me (check the package manual/description) if you need more help.
  Best regards.
  
  Reply ↓

Querying PubMed via the easyPubMed package in R

About Author

36 Comments

Leave a Reply to koushik Cancel reply