Skip to content

tomthe/hackmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Hackmap

https://tomthe.github.io/hackmap

This website shows a few million submissions, comments and users of Hacker News. The position is determined by an algorithm that places similar titles close to each other. The size of the node represents the number of citations.

How to use:

  • Use the search box to find an author or a paper
  • Use the mouse wheel or two fingers to zoom in and out
  • Click on a node and you can find a link to the source in this place

The data was downloaded from the HN-API But since ~40 Million items are too much for a browser, I kept only those 2.6 Million with at least some replies. I fed the comments to a SentenceTransformers model (all-MiniLM-L6-v2) to create text embeddings. Titles of submissions are often ambiguous, so I used the average embeddings of their comments to get a better representation of the content of submissions. The same was done for the users. Then I used UMAP to reduce the dimensionality of the embeddings to 3, 2 and 1 dimensions. 3 for the colors, 2 for the placement of the nodes and 1 for a plot with the time dimension. But the 3D colors didn't add much information, so I removed them.
I also used Bertopic to get clusters and names for these clusters... but they also don't add much information upon the titles of the submissions.
There are several implementations of maps like this. (todo: add links) Some of them are very sophisticated, but they don't show the actual text on the canvas. I think showing as much information as possible, while not overwhelming the user (and browser...) is very important for how much the user can get out of such a visualization of big data. Another important aspect is that I wanted to host the whole thing on a static hoster, which makes things much easier in the long term. I used mostly vanilla Javascript (good decision for such a site - no build step and no fighting against Svelte or React) and the excellent force-graph library.

Since there are too many data points to show at once, the page fetches a base map with the 40 000 most important nodes and then fetches additional data tiles when you zoom in. Unfortunately, I couldn't find the time to implement a static search over all the data, so the search currently only works for the base-tile of 40 000 nodes.
The color of the nodes is based on the publication date. The size is based on the score of submissions and the number of direct and indirect child comments for comments and users.


The biggest challenge in this project was that it worked so well that I got constantly distracted by the stories and comments that I discovered while testing the plot. This is why I release it now in this work-in-progress state. Firefox doesn't render some nodes when zoomed in too much, Chrome renders them, but has problems with showing the correct tooltips.

Candos and Todos:

  • Better search
  • More levels of tiles
  • Tuning of the size and show parameters
  • Earlier data from HN
  • Better UI
  • Other datasets (MusicBrainz, OpenAlex, Newspapers,...)
  • If you have any questions or suggestions, please get in touch at [email protected]

    The code can be found on Github: github.com/tomthe/demographymap