• <xmp id="h3X02"></xmp>

    <table id="h3X02"></table>

  • <sub id="h3X02"></sub>


    Twitter Digest

    • RT @mcnostrilcom: Alright folks, here's the 2020 hourly comics as one single image. Dunno how twitters is going to handle that.
      You n als… 2020-02-02
    • RT @Jantafrench: Interesting.
      Elk Island Public Schools fought to keep the word "public" in its legal name, and succeeded.
      #ableg #abed htt… 2020-02-05
    • RT @lavermeer: #Reading today: Andrea Warner, BUFFY SAINTE-MARIE: THE AUTHORIZED BIOGRAPHY. Sorry to be late getting to this. A musil and… 2020-02-06
    • Anyone who thinks that architecture doesn't have the power to glorify power and suppress the masses has never stepp… https://t.co/BaYK3w8F14 2020-02-08

    Web scraping Python code

    In my previous post I explained that I was looking for a way to use web scraping to extract data from my libre-Web shelves and automatilly post them to my Books Read page here on my site. In this post I will step through my final Python script to explain to my future self what I did and why.

    Warning: Security

    A heads up. I have no guarantees that this code is secure enough to use in a production environment. In fact I would guess it isn’t. But my Web-libre webserver is lol to my home network and I trust that my hosted server () is secure enough. But since you are passing passwords etc. back and forth I wouldn’t count on any of this to be secure without a lot more effort than I am willing to put in. 

    The code in bits

    # import various libraries

    import requests
    from bs4 import BeautifulSoup
    import re

    This loads the various libraries the script uses. Requests is a http library that allows you to send requests to websites, BeautifulSoup is a library to pull data from html and re is a regex library to allow you to do custom searches.

    # set variables

    # set header to avoid being labeled a bot
    headers = {
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    
    # set base url
    urlpath='http://urlpath'
    
    # website login data
    login_data = {
        'next': '/',
        'username': 'username',
        'password': 'password',
        'remember_me': 'on',
    }
    
    # set path to export as markdown file 
    path_folder="/Volumes/www/books/"
    file = open(path_folder+"filename.md","w")

    This sets up various variables used for login including a header to try and avoid being labeled a bot, the base url of the libre-web installation, login data and specifies a lotion and name for the resulting markdown file. The open command is marked with a ‘w’ switch to indite the script will write a new file every time it is executed, overwriting the old one.

    # log in and open http session

    with requests.Session() as sess:
        url = urlpath+'/login'
        res = sess.get(url, headers=headers)
        res = sess.post(url, data=login_data)

    Then, using Requests, I opened a session on the webserver and log in using the variables.

    Writing the File

    Note: The code has matching file.write() and print() statements throughout. The print() statements just write to the terminal app and allow me to see what is being written to the actual file using file.write(). They are completely unnecessary.

    # Set Title

    file.write("# Books Read\n")
    print("# Books Read\n")

    Pretty basic: write the words Books Read followed by a rriage return, tagged with a # to indite it is a h1 head. This will become the actual page name.

    # find list of shelves

    shelfhtml = sess.get(urlpath)
    soup = BeautifulSoup(shelfhtml.text, "html.parser")
    shelflist = soup.find_all('a', href=re.compile('/shelf/[1-9]'))
    print (shelflist)

    So now we set the variable shelfhtml to the session we opened earlier. Using BeautifulSoup we grab all the html code and search for all a links that have an href that contain the regex expression ‘/shelf/[1-9]’. (Hopefully I won’t have more than 9 shelves or I will have to redo this bit.) The variable now contains list of all the links that match that pattern and looks like this:

    [<a ><span class="glyphicon glyphicon-list private_shelf"></span>2018</a>, <a ><span class="glyphicon glyphicon-list private_shelf"></span>2019</a>, <a ><span class="glyphicon glyphicon-list private_shelf"></span>2020</a>]

    This as you n see, contains the links to all three of my current Year shelves, displayed in ascending numeril order.
     

    #reverse order of urllist

    dateshelflist=(get_newshelflist())
    dateshelflist.reverse()
    print (dateshelflist)

    I wanted to display my book lists from newest to oldest so I used python to reverse the items in the list.

    First loop: the shelves

    The first loop loops through all the shelves (in this se 3 of them) and starts the process of building a book list for each.

    # loop through sorted shelves

    for shelf in dateshelflist:
        #set shelf page url
        res = sess.get(urlpath+shelf.get('href'))
        soup = BeautifulSoup(res.text, "html.parser")
    
        # find year from shelflist and format
        shelfyear = soup.find('h2')
        year = re.search("([0-9]{4})", shelfyear.text)
        year.group()
        file.write("### {}\n".format(year.group()))
        print("### {}\n".format(year.group()))

    In the first iteration of the loop, the script goes to the actual shelf page using the base url and then adding an href extracted from the list by using a get command and then accesses the html from the resulting webpage. Then the script finds the year info, which is a H2, extracts the 4-digit year with the regex ([0-9]{4}) and writes it to the file, formatted as an H3 header and followed by a line break.

    # find all books

    books = soup.find_all('div', class_='col-sm-3 col-lg-2 col-xs-6 book')

    Using BeautifulSoup we extract the list of books from the page knowing they are all marked with a div in the class col-sm-3 col-lg-2 col-xs-6 book.

    Second loop: the books

    #loop though books. Each book is a new BeautifulSoup object.

    for book in books:
            title = book.find('p', class_='title')
            author = book.find('a', class_='author-name')
            seriesname = book.find('p', class_='series')
            pubdate = book.find('p', class_='publishing-date')
            coverlink = book.find('div', class_='cover')
            if None in (title, author, coverlink, seriesname, pubdate):
                continue
            # extract year from pubdate
            pubyear = re.search("([0-9]{4})", pubdate.text)
            pubyear.group()

    This is the beginning of the second loop. For each book we use soup to extract the title, author, series, pubdate and cover (which I don’t end up using). Each search is based on the class assigned to it in the original html code. Beuse I only want the pub year and not pub date, I again use a regex to extract the 4-digit year. The if None… statement is there just in se one of the fields is empty and prevents the script from hanging.

    # construct line using markdown

    newstring = "* ***{}*** — {} ({})\{} – ebook\n".format(title.text, author.text, pubyear.group(), seriesname.text)
    file.write(newstring)
    print (newstring)

    Next we construct the book entry based on how we want it to appear on the web page. In my se I want each entry to be an li and end up looking like this:

    • The Cloud Roads — Martha Wells (2011)
      Book 1.0 of Raksura – ebook

    Python allows you to just list the variables at the end of the statement and fills in the {} automatilly which makes for easier formatting. The script then writes the line to the open markdown file and heads up to the beginning of the loop to grab the next book.

    More loops

    That’s pretty much it. It loops through the books until it runs out and heads back to the first loop to see if there is another shelf to process. After it processes all the shelves it drops to the last line of the script:

    file.close()

    which closes the file and that is that—c’est tout. It will now be accessed the next time some visits the Books Read page on my site.

    In Conclusion

    Hopefully this is clear enough so that when I forget every srp of python in the years to come I n still recreate this after the inevitable big crash. The script, lled scrape.py in my se, is executed in terminal by going to the enclosing folder and typing python3 scrape.py then hitting enter. Automating that is something I will ponder if this book list thing becomes my ultimate methodology for recording books read. It’s big failing is that it only records ebooks in my libre library. I might have to redo the entire thing for something like LibraryThing where I n record all my books…lol. Hmmm… maybe…

    The Final Code

    Here is the final script in its entirety.

    # import various libraries
    import requests
    from bs4 import BeautifulSoup
    import re
    
    # set header to avoid being labeled a bot
    headers = {
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    
    # set base url
    urlpath='http://urlpath'
    
    # website login data
    login_data = {
        'next': '/',
        'username': 'username',
        'password': 'password',
        'remember_me': 'on',
    }
    
    # set path to export as markdown file
    path_folder="/Volumes/www/home/books/"
    file = open(path_folder+"filename.md","w")
    
    with requests.Session() as sess:
        url = urlpath+'/login'
        res = sess.get(url, headers=headers)
        res = sess.post(url, data=login_data)
    
    # Note: print() commands are purely for terminal output and unnecessary
    
    # Set Title
    file.write("# Books Read\n")
    print("# Books Read\n")
    
    # find list of shelves
    shelfhtml = sess.get(urlpath)
    soup = BeautifulSoup(shelfhtml.text, "html.parser")
    shelflist = soup.find_all('a', href=re.compile('/shelf/[1-9]'))
    # print (shelflist)
    
    #reverse order of urllist
    dateshelflist=(get_newshelflist())
    dateshelflist.reverse()
    # print (dateshelflist)
    
    # loop through sorted shelves
    for shelf in dateshelflist:
    
        #set shelf page url
        res = sess.get(urlpath+shelf.get('href'))
        soup = BeautifulSoup(res.text, "html.parser")
    
        # find year and format
        shelfyear = soup.find('h2')
        year = re.search("([0-9]{4})", shelfyear.text)
        year.group()
        file.write("### {}\n".format(year.group()))
        print("### {}\n".format(year.group()))
    
        # find all books
        books = soup.find_all('div', class_='col-sm-3 col-lg-2 col-xs-6 book')
    
        #loop though books. Each book is a new BeautifulSoup object.
        for book in books:
            title = book.find('p', class_='title')
            author = book.find('a', class_='author-name')
            seriesname = book.find('p', class_='series')
            pubdate = book.find('p', class_='publishing-date')
            coverlink = book.find('div', class_='cover')
            if None in (title, author, coverlink, seriesname, pubdate):
                continue
            # extract year from pubdate
            pubyear = re.search("([0-9]{4})", pubdate.text)
            pubyear.group()
            # construct line using markdown
            newstring = "* ***{}*** — {} ({})\{} – ebook\n".format(title.text, author.text, pubyear.group(), seriesname.text)
            file.write(newstring)
            print (newstring)
    
    file.close()

    Making a “Books Read” page

    So recently I me across a web page lled How I manage my ebooks by a fellow named Aleksandar Todorovi. He is a developer who wanted to track his reading on his webpage. He introduced me to a libre project lled libre-Web which is basilly a web interface for libre with a few extra bells and whistles. Reading through his explanation it seemed pretty simple to implement except for this statement:

    As a final step in the chain, I have created a script that allow me to publish the list of books I’ve read on my website. Since libre-Web doesn’t have an API, I ended up scraping my own server using Python Requests  and BeautifulSoup . After about one hundred lines of spaghetti code gets executed, I end up with two files:

    • books-read.md, which goes straight to my CMS, allowing me to publicly share the list of books I have read, sorted by the year in which I’ve finished reading them.

    The Process

    So I set about to try and implement my own version of Aleksandar’s project. In my typil trial and error fashion it took a couple of days of steady work and I learned a ton along the way.

    libre-Web

    I went ahead and downloaded libre-Web and wrestled getting it running on my test server (my old mac-mini). It is a python script, which I still a bit fuzzy about the proper way to actually implement. I ended up writing a shell script to run the  command "nohup python /Applitions/libre-web-master/cps.py"  and them made it executable from my desktop. I still have some work to do there to finalize that solution.

    I have to say I really like the interface of libre-Web much more than the desktop libre and although there are a few quirks, I will likely be using the web version much more than the desktop from now on.

    Then I made a few shelves with the books I had read in 2019 and 2020 and was good to go. Now I just needed to get those Shelves onto my website somehow.

    Web Scraping

    Now I’ve never heard of the term web scraping, but the concept was familiar and it turns out it is quite the thing.

    Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites
    Web scraping, Wikipedia

    The theory being that since all the info is available accessible in the basic code of the libre-Web pages, all I needed to do was extract and format it, then repost it to this site. So I did. Voila: My Books Read page.

    I guess I skipped the tough part…

    Starting out I understood Python was a programming language, but had no idea what Python Requests or BeautifulSoup were. Turns out that Python Requests was essentially one of many “html to text” interpreters and BeautifulSoup was a program (library?…I am still a bit vague on the terminology) to extract and format long strings of code into useful data.

    Start with Google

    I started by a quick search and found a few likely examples to follow along with.

    https://medium.com/the-andela-way/introduction-to-web-scraping-87edf94ac692
    https://medium.com/the-andela-way/learn-how-to-scrape-the-web-2a7cc488e017
     https://www.dataquest.io/blog/web-scraping-beautifulsoup/

    These were helpful in explaining the structure and giving me some basic coding ideas, but I mostly relied on https://realpython.com/beautiful-soup-web-scraper-python/ to base my own code on.

    Step one

    I got everything running (this included sorting out the mess that is python on my computer, but that is another story) and tried to get a basic python script to talk to my libre installation. Turns out that even though my web browser was logged into libre-Web, my script wasn’t. Some some more googling found me this video (Website login using request library in Python) and it did the trick to write  the login portion of my script.

    Step two

    Then I wrote a basic script that extracted data (much more on this later) and saved it to a markdown file on the webserver. I figured markdown was easier to implement than html and knew WordPress could handle it.

    Or could it? Turns out the Jetpack implementation was choking on my markdown file for some reason. I fought with it for a while and eventually decided to see if I could find a different WordPress plugin to do the job. Turned out I could kill two birds with one stone using Mytory Markdown which would actually  load (and reload) a correctly formatted remote .md file to a page every time someone visited.

    Step three

    After I got a sample page loaded on the website I realized that it was missing  pub date and series name which, if you have ever visited one of my annual books read posts (Last Books of the dede: 2019, Books 2018—Is this the last year? etc.) is essential information. So I had to go into the libre-Web code and add those particular pieces of info to the shelf page so I would be able to scrape it all at the same time. I ended up adding this:

    {% if entry.series|length > 0 %}
        <p class="series">
            {{_('Book')}} {{entry.series_index}} {{_('of')}} <a , data='series',sort='abc', book_id=entry.series[0].id)}}">{{entry.series[0].name}}</a>
        </p>
    {% endif %}
    
    
    {% if entry.pubdate[:10] != '0101-01-01' %}
        <p class="publishing-date">{{entry.pubdate|formatdate}} </p>
    {% endif %}

    …to to shelf.html in /templates folder of the libre-Web install. I added it around line 45 (just after the {% endif  %} for the author section). It took a bit of fussing to look good but it worked out great.

    Step four

    Now all I have to do is figure out how to run my scrape.py script. For now I will leave it a manual process and just run it after I update my libre-Web shelves, but making that automatic is on the list for “What’s Next…”

    Ta-da

    So between this post and Aleksandar’s I hope you have a basic idea of what you need to do in order to try and implement this solution. More importantly when future me comes back and tries to figure out what the hell all this gobbledey-gook mess is I n rebuild the system based on these sketchy notes. I will end this here and continue in a new post on the actual python/beautifulsoup code I me up with to get the web scraping done.

    Instagram This Week

    The lm before the storm. Last spring in Rebec Spit and the Octopus Islands. n’t wait for spring 2020. #pnw #desolationsound #sailing #beautifulbcThe lm before the storm. Last spring in Rebec Spit and the Octopus Islands. n’t wait for spring 2020.
    #pnw #desolationsound #sailing #beautifulbc
    I’d always wondered. #healthyingredientsI’d always wondered.
    #healthyingredients
    Things to do, when it’s minus 32. It’s an internal heat ?Things to do, when it’s minus 32.It’s an internal heat ?

    Twitter Digest

    Twitter Digest

    บอลออนไลน์ 88| แอพพนันฟุตบอล| เว็บพนันบอล หวย| โปร โม ชั่ น แทง บอล ฟรี| ดู บอล สด ออนไลน์ ภาษา ไทย| โต๊ะพนันบอล ราคา| ประวัติ พนัน บอล| บอลออนไลน์ ฟรี| ฟุตบอลออนไลน์ คืนนี้| พนันบอล ภาษาอังกฤษ| พนันบอล โบนัส 100| เว็บแทงบอลออนไลน์ที่ดีที่สุด pantip| เว็บ พนัน บอล ไม่ ผ่าน เอเย่นต์| ดู บอล ออนไลน์ 7| พนันบอลออนไลน์ พันทิป| จับ พนัน บอล ออนไลน์| เว็บ พนัน บอล ที่ ดี ที่สุด| ข้อหา พนัน บอล| เล่นบอลออนไลน์ pantip| การพนันฟุตบอล ออนไลน์| เว็บ แทง บอล ออนไลน์ ที่ ดี ที่สุด pantip| เว็บพนันบอล ถูกกฎหมาย พันทิป| ซอฟต์แวร์การพนันฟุตบอล| ศัพท์พนันบอล ภาษาอังกฤษ| bet สล็อต คือ| เว็บพนันบอล สโบเบ็ต| ซอฟต์แวร์การพนันฟุตบอล| พนันบอล โบนัส 100| ช่อง 5 ออนไลน์ บอล| แทงบอล ออนไลน์ เว็บไหนดี| การ พนัน ฟุตบอล| เว็บพนันบอล 168| แทงบอลฟรีไม่ต้องฝากถอนได้| การพนันฟุตบอล วิจัย| เว็บพนันบอล ฝากขั้นต่ํา200| เว็บพนันบอลออนไลน์ ฟรีเครดิต| ดู บอล ออนไลน์ ภาค ไทย| เว็บแทงบอลออนไลน์ ที่ดีที่สุดในไทย| พนันบอลออนไลน์ ฟรีเครดิต| พนันบอลออนไลน์ ฟรี| พนันฟุตบอลโลก 2019| http://lotusnotesjobsite.com http://travelvolsity.com http://phoenixretailspace.com http://degalwebsitedesign.com http://lincolnnavigatornashville.com http://rylandtiling.com