Analysing 65 million data points about estate agents

Written by Joshua Stopper - May 20, 2020

In late 2019, we sat down to begin discovery on a new project with Ben Johnson. Previously working with Ben on various projects at Wholegrain Digital, we knew we would be in for something interesting. Expanding on previous work we had completed, the objective this time was to analyse more than 65 million data points on property sales to understand market trends and compare estate agents. 

Working closely with Ben throughout the project lifecycle, we achieved an end result that communicates market trends across the United Kingdom nationwide and right down towns and cities. You can see the live Agent Guide website here.

The Agent Guide

The Agent Guide as it stands is not the first version of the idea, and as such, that presented us with the perfect opportunity to discuss the pain points, learnings and objectives for the rebuild. Beginning with understanding The Agent Guide and its market position, we discussed, amongst others, two technical pain points that were holding the site back.

The first pain point was the amount of manual work required to keep the site up to date with the latest statistics. Requiring the administrator to manually upload a sizable (over 2gb) CSV to the site, introduced a considerable amount of time to the process as it depended on the speed of their internet. With large uploads to the site not always being reliable, the process resulted in a lengthy, sometimes tedious process to keep the statistics up to date.

The second pain point was processing the raw data to generate information and interesting statistics. Processing large amounts of data into statistics can be a computationally expensive process, and when the data is not always clean, in the sense that it requires multisite steps of processing to be in a uniform manner, it can take quite some time to run queries against. Further complicated by the data being in a CSV format, it’s no small task. The first iteration of the site showed statistics such as the amount of properties available or the average listing time. For the new version, the client wanted to segment the data by additional metrics including: postcode (outcode and incode), property type (house, detached, semi-detached, etc) and bedrooms (1, 2, 3, etc). Ultimately this required a deep understanding of how the data was structured, and the possible quirks contained within.

Automating the data import

As any experienced developer will tell you, processing large amounts of data in a CSV is never ideal, it’s simply not the right tool for the job. Optimally, the data would be in a database where it can be queried directly with a language like SQL, a tool designed for the job. This was our first point of call. A large CSV is rarely the source of truth for a large dataset, most likely it was in a database and exported to CSV for review without granting direct access to the database. Working with the client, we helped them ask the right questions to their internal team, which resulted in exactly what we needed, a SQL file that we could import directly and analyse with the right tools.

This was only a third of the problem, we also needed to keep the local copy of the data up to date. Working again with the client, we agreed upon a schedule for which the latest export of the database would be made, resulting in a file hosted at a specific (and private) URL that could be imported at scheduled intervals.

The final third, was one of scale and performance. Most of the time, when a scheduled task is run from WordPress, it will be run in the same context as the website itself. What this can lead to, given the task takes long enough, and enough traffic to the site, is a degradation in performance, resulting in extended load times for end users. Further to that, most tasks that are run from WordPress have a limit on the time that they are allowed to run, which could result in a partial import of the data. Clearly, this wasn’t an option. The solution then was to run the whole import task directly on the server, outside of WordPress and the technology handling the end user requests. As the final piece to the puzzle, we wrote a bash script that would

  • Download the latest copy of the SQL file from the remote server
  • Remove the previously imported property data from the database
  • Import the new SQL file to the database
  • Remove the SQL file from the server as it consumes disk space
  • Clear the caching mechanisms in place that ensure a high level of performance
  • Notify the appropriate people of the scripts completion

This script was then scheduled at the server level to run once a month, outside of business hours, outside of WordPress, so that end users were not impacted by the process.

Analysing 65 million data points

Now that the data was available in a manner that allowed for efficient querying, we began working through the task of turning raw data into information and statistics. To begin with, we worked with the client to understand the statistics that were of interest, and what the parameters were to be for querying the data to generate the statistics.

The Agent Guide is dedicated to showing statistics about estate agents in a particular area. Immediately we had our first parameter for querying data – the area. In the database, an area is identified by postcodes, so that became our first means of segmenting the data. Once we had the property data for an area, we needed to analyse it to generate particular statistics. The database contains the name of the estate agency that has the property for sale, but it wasn’t in a structure that allowed us to run the queries against it that we needed. Fortunately the query by postcode reduced the amount of data that would need to be processed to generate the statistics by a large enough factor to perform the rest of the analysis in PHP.

Once we had the property data available to analyse with PHP, the first task we needed to complete was to segment the properties again, but this time by the estate agent that sold them. By looping over each of the properties, we generated a list of the agents operating in the area, and a list of the properties associated with each agent.

After that it was an exercise in basic mathematics to generate the statistics shown on the site. For the average listing price it was simple as adding the prices together and then dividing by how many properties there were. 

A screenshot of the above the fold area on the home page of
Housing market overview for Norwich showing the average asking price, the average time a property is listed, the average agent fee, and the number of properties currently available

Keeping it up to date / Automation

Once the system was capable of generating the statistics required, and outputting them to the screen, from a technical standpoint, it became an exercise in making it performant. Processing all 65 million data points in a performant manner is no mean feat, so we looked for methods of improving performance so the overhead of generating the statistics didn’t occur on every page load. 

To achieve a performant page load we introduced multiple caching layers, each targeting different parts of the process that ultimately delivers a page to a browser. Primarily we focussed on two areas, the generation of the statistics, and the generation of the html that is the page. Since the statistics were to be valid for a month, we first implemented a caching layer around the statistics using WordPress transients. The transients would store the calculated statistic in the database for one month so that they did not need to be generated on each page load. Further reducing the time to load the page, we implemented a front end cache that stores the HTML for the whole page, thus removing the need to generate the HTML everytime the page was requested.

Finally, tying this whole system together is the bash script we mentioned earlier. When the script finishes running to bring updated property information to the site for analysis (normally once a month), the WordPress transients and frontend cache are cleared, resulting in new statistics and pages being generated for display.

Not just a technical exercise

The Agent Guide was never just a technical exercise for a few statistics. At its core, there was always the need to take the statistics and information, and present it in a meaningful way that helped end users compare local estate agents and understand the market. Early on we identified what statistics we wanted to present and got to work on the best means of presenting it.

The Agent Guide is not just about finding the best agent in your area, but about finding the best for you. With this in mind, there’s no one key metric that defines the perfect agent, and ultimately this led to a few outcomes. The first being multiple ways to present the agents, and the second being multiple ways to sort the agents.

The list view of the agents in and around Norwich
The list view of the agents in and around Norwich

For presenting the data, we settled on three very different views: 

  1. a list view that presented each of the agents and the statistics associated with each, 
  2. a compare view that highlighted the best agent for each statistic, 
  3. and a map view that showed where the agent’s office is located. 

When sorting the agents on the list view, the user can sort by agent name, number of properties currently available, average asking price, percentage of properties that have had a price drop in the last 12 months, or average listing time, all in ascending or descending order.

The map view of the agents in and around Norwich
The map view of the agents in and around Norwich

Push the boundaries

WordPress is a very capable platform, and can mold to fit the needs of most projects. Working with the client to fully understand the pain points, we proposed and implemented a solution that automated the entire process of retrieving, importing, segmenting, and analysis of over 65 million data points on estate agents. 

The Agent Guide is just one example of a Wordpress solution we have delivered that combines beautiful, clear, user-centric design with powerful and considered technical development. Even if you don’t have 65 million data points we can probably help, so get in touch for a chat.