Blog

Patents I: Scraping against intellectual property

October 4th 2021

I have always had my doubts about the theory behind intellectual property, however, I have also never had any data to back this up. So this series of posts will be an attempt to gather some data while also exploring the functioning of some websites, how to gather information online and possibly some machine learning. All of the code for these posts can be found on GitHub. A lot of these will be very short and simple, such as this one and some will be longer and involve more exploration depending on the level of bot-prevention on the website.

As I searched for databases of IP lawsuits online, I came across Justia, a website that keeps searchable records of legal cases which can be filtered by type and state. Unfortunately, you have to pay for access to a lot of their data. However, you can still access a the names of the participants which is a good place to start.

Grabbing these is a simple matter of getting the html page and using regular expressions to find the names.


    import requests
    import random
    
    headers = {
        'User-Agent': ''.join(random.sample('abcdefghijklmnopqrstuvwxyz', 10))
    }
    
    url = f'https://dockets.justia.com/browse/state-{state}/noscat-10/nos-830?page={str(page)}'
    
    response = requests.get(url, headers=headers)

The Python requests module gives us all of the necessary http utilities. It is also often useful to randomize the user agent header. This is because the default requests headers are used so often that they are easily flagged as bots by even the most primitive detection systems.

I also added an option to filter by the state in which the case is filed. The filing state is often the same as the state in which the defendant is registered. This helps to reduce processing times because we need to check less state corporation databases later on to find information about one of the two parties.

Finally, regular expressions allow us to grab all of the characters between strong tags in a div with the casename class.


    cases = [case.split(' v. ') for case in re.findall('class="case-name"><strong>(.+?)<\/strong><\/a>', response.content.decode('utf-8'))]

The next post will cover finding data about these corporations on the New York and California department of state websites.