Evan Pratten
Evan Pratten
Software Developer

Bill Wurtz is an American musician who became reasonably famous through short musical videos posted to Vine and YouTube. I was searching through his website the other day, and stumbled upon a page labeled notebook, and thought I should check it out.

Bill's notebook is a large (about 580 posts) collection of random thoughts, ideas, and sometimes just collections of words. A prime source of entertainment, and neural network inputs..

"If you are looking to burn something, fire may be just the ticket" - Bill Wurtz

Choosing the right tool for the job

If you haven't noticed yet, Im building a neural net to generate notes based on his writing style and content. Anyone who has read my first post will know that I have already done a similar project in the past. This means time to reuse come code!

For this project, I decided to use an amazing library by @minimaxir called textgenrnn. This Python library will handle all of the heavy (and light) work of training an RNN on a text dataset, then generating new text.

Building a dataset

This project was a joke, so I didn't bother with properly grabbing each post, categorizing them, and parsing them. Instead, I build a little script to pull every HTML file from Bill's website, and regex out the body. This ended up leaving some artifacts in the output, but I don't really mind.

import re
import requests


def loadAllUrls():
    page = requests.get("https://billwurtz.com/notebook.html").text

    links = re.findall(r"HREF=\"(.*)\"style", page)

    return links


def dumpEach(urls):
    for url in urls:
        page = requests.get(f"https://billwurtz.com/{url}").text.strip().replace(
            "</br>", "").replace("<br>", "").replace("\n", " ")

        data = re.findall(r"</head>(.*)", page, re.MULTILINE)

        # ensure data
        if len(data) == 0:
            continue

        print(data[0])


urls = loadAllUrls()
print(f"Loaded {len(urls)} pages")
dumpEach(urls)

This script will print each of Bill's notes to the console (on it's own line). I used a simple redirect to write this to a file.

python3 scrape.py > posts.txt

Training

To train the RNN, I just used some of textgenrnn's example code to read the posts file, and build an HDF5 file to store the RNN's neurons.

from textgenrnn import textgenrnn

generator = textgenrnn()
generator.train_from_file("/path/to/posts.txt", num_epochs=100)

This takes quite a while to run, so I offloaded it to a Droplet, and left it running overnight.

The results

Here are some of my favorite generated notes:

"note: do not feel better"

"hi I am a car."

"i am stuff and think about this before . this is it, the pond. how do they make me feel better?"

"i am still about the floor"

Not perfect, but it is readable english, so i call it a win!

Play with the code

I have uploaded the basic code, the scraped posts, and a partial hdf5 file to GitHub for anyone to play with. Maybe make a twitter bot out of this?