I wrote a few words on how to use Selenium and how to capture screenshots of the website in my previous posts. In this one, I want to show how this knowledge was used to save the contents of the report to the Word file. The report is normally displayed on the website, but my client also wanted to be able to download the contents as a Word document.
What we will need?
There are several pieces we will use to get the result. We will need Selenium working with Firefox (I use Firefox because the web driver is not changing as frequently as the one for Chrome), the web page with the report, and the python-docx library.
In a few words, the engine will work as follows:
- Selenium loads the page
- Once the page is loaded, the screenshots of particular pieces are taken
- The data from particular places in the report are retrieved
- Screenshots and data are placed in Word document
- Word document is saved to the file
Let’s take a look at the whole code and discuss it bit by bit:
from selenium import webdriver from docx import Document from docx.shared import Inches, Pt from docx.enum.text import WD_ALIGN_PARAGRAPH from selenium.webdriver.firefox.options import Options def start_word_document(driver): report_title = driver.find_element_by_id('masthead').text document = Document() p = document.add_paragraph() p.alignment = WD_ALIGN_PARAGRAPH.CENTER title = p.add_run(report_title) title.bold = True title.font.size = Pt(16) return document def perform_images_export(driver, document): posts = driver.find_elements_by_css_selector('.post-content-wrapper') for post in posts: post.screenshot('post.png') document.add_picture('post.png', width=Inches(3)) def main(): options = Options() # options.headless = True driver = webdriver.Firefox(options=options) driver.get('https://handyman.dulare.com/') document = start_word_document(driver) perform_images_export(driver, document) driver.quit(); document.save('my_report.docx') if __name__ == '__main__': main()
The main function opens the Firefox browser, loads the page, and performs two actions: creates Word document and export images. Once it is done, the browser is closed and the document saved to the drive.
If you wonder why “options.headless = True” is commented out – you can uncomment it if you don’t want to look at the browser during the execution of the script. For demo purposes, it is better to see what happens in the browser. For the scripts that run in the background, there is no need to see the browser and in such a case I’m using headless mode.
The “start_word_document” function is taking care of the creation of the Word document, it also retrieves the page title from the “masthead” element and places it in the first paragraph of the document. There is also text alignment and size set. You may find it handy to create in-document headings or dividers.
Once the document is created and filled with some text, the “perform_images_export” function is taking care of finding all elements of class “post-content-wrapper”. In the next step, we iterate through these elements and perform a screenshot of each of them. Each screenshot is added to the document (with a size of 3 inches).
On the real report, there are more pieces – more text fragments and more images to put in the document, but this code should be useful for you as a starting point for further investigations.