Introduction:
1、All Seattle jobs from Hacker News 'Who is hiring? (October 2018)' post

2、Hacker News Embeddings with PyTorch
All Seattle jobs from Hacker News 'Who is hiring? (October 2018)' post ♂
Citrine | Redwood City, CA; Pittsburgh, PA | Backend, Frontend, Infrastructure, and Scientific Software Engineers | ONSITE or REMOTE | Full-time | https://citrine.io/ |
Citrine Informatics is building the next-generation materials development platform from the ground up, using the power of domain expertise, data, and machine learning to bring new materials to market faster, and capture materials-enabled product value.
As we scale, there are opportunities across our whole product/stack for people to join, learn and contribute; not only to the success of the company but also to solve real world, pertinent problems.
We are a team with diverse backgrounds and experiences; trained in materials science, data science, physics, biology, as well as computer science. The org is remote friendly, especially engineering where a bunch of us are remote FT (Seattle, Salt Lake City, Ann Arbor, FL to name a few). You can work out of our offices in Bay area (Redwood City), Pittsburgh, or remote, choice is yours.
Positions:
Srhttps://hnhiring.com/locations/seattle/months/ SSE : https://citrine.io/careers/#scientific-software-engineer
Srhttps://hnhiring.com/locations/seattle/months/ BE : https://citrine.io/careers/#senior-backend-software-engineer
Srhttps://hnhiring.com/locations/seattle/months/ Infra : https://citrine.io/careers/#infrastructure-engineer
Srhttps://hnhiring.com/locations/seattle/months/ Full Stack Engineer: https://citrine.io/careers/#senior-full-stack-software-engin...
Link to general job page: https://citrine.io/careers/
Some buzzwords to give an idea of what we are working with but if you have experience of building and delivering quality product and take pride in enabling your users and team members, feel free to reach out even if you don’t tick everything below.
Backend: Java/Scala, Ruby (RoR), Python
DS: Scala, Python
Frontend: Angular, React
DataStores: PostgreSQL, ElasticSearch
Cloud: AWS
Tooling: Jenkins, JUnit, Maven, SBT, etc.
Our customers include some of the world’s largest Fortune 1000 materials and product companies. Citrine is backed by leading investors including Tencent Holdings, B&C Holdings, Innovation Endeavors, DCVC (Data Collective), Prelude Ventures, AME Cloud, XSeed Capital, Morado Ventures, and Ulu Ventures.
Hacker News Embeddings with PyTorch ♂
This post is based on Douwe Osinga’s excellent Deep Learning Cookbook, specifically Chapter 4, embeddings. Embedding is a simple thing: given an entity like a Hacker News post or a Hacker News user, we associate an n-dimensional vector with it. We then do a simple thing: if two entities are similar in some way, we assert that the dot product (cosine similarity) should be , ie. the vectors should be “aligned”. If two entities are not similar, we assert that the dot product should be , ie. they should point in different directions. We then feed the data to a model, and in the training process get the optimizer to find assignments of entities to vectors such that those assertions are satisfied as much as possible. The most famous example of embeddings is Google's word2vec.
In the book, embedding is performed on movies. For each movie, the wikipedia page is retrieved, and outgoing links to other wiki pages are collected. Two movies are similar if they both link to the same wiki page, else they are not similar. Keras is used to train the model and the results are reasonably good.
I wanted to implement the same thing in PyTorch, but on a different data set, to keep it interesting. As a regular Hacker News reader, I chose Hacker News. Likes of user are not public, but comments are, so I use that for similarity.
The plan is:
Retrieve the top 1,000 HN posts from 2018 by number of comments
For each post, retrieve the unique set of users who commented
Use these pairs for similarity embedding
Train with mean squared error (MSE)
Use the resulting model to get:
post similarity: if I like post P, recommend other posts I might like
user recommendations: I am user U, recommend posts I might like
All the code shown here, with the data files, is up on Github.
The simplest way to get this is from Google BigQuery, which has a public Hacker News dataset. We can write a SQL query and download the results as a CSV file from the Google Cloud console:
The result of this is top_1000_posts.csv.
Getting the comments is not practical from BigQuery because the table stores the tree hierarchy ( of the parent comment, but not the ), so we’d have to query repeatedly to get all the comments of the post, which is inconvenient. Fortunately there’s an easier way. Algolia has a Hacker News API where we can download one big JSON per post, containing all the comments. The API endpoint for this is:
So we just go through all the posts from the previous step and download each one from Algolia.
Getting the set of commenters out of the JSON would be the easiest with , but this sometimes fails on bad JSON. Instead we use an rxe regexp:
The entire code for this download script is on Github. The script caches files, so repeatedly running it doesn’t repeatedly re-download data from Algolia.
The script outputs the pairs into post_comments_1000.csv.
PyTorch has a built-in module for Embeddings, which makes building the model simple. It’s essentially a big array, which stores for each entity the assigned high-dimensional vector. In our case, both posts and users are embedded so if there are posts and users, then . So the array has row, each row corresponds to that entity’s embedding vector.
PyTorch will then optimize the entries in this array, so that the dot products of the combinations of the vectors are and as specified during training, or as close as possible.
The next step is to create a Model which contains the embedding. We implement the function, which just returns the dot product for a minibatch of posts and users, as per the current embedding vectors:
Next, we need to write a function to build the minibatches we will use for training. For training, we will pass in existing combinations and “assert” that the dot product should be , and some missing combinations with :
Now we can perform the training. We will embed into 50 dimensions, we will use 500 positive and 500 negative combinations per minibatch. We use the Adam optimizer and minimize the mean squared error between our asserted dot products and the actual dot products:
Output:
We can see that training is able to reduce the MSE by about 40% from the initial random vectors by finding better alignments. That doesn’t sound too good, but it’s good enough for recommendations to work. Let’s write a function to find the closest vectors to a query vector:
The entire ipython notebook is on Github. We can use this to find similar posts, it works reasonably well.
Query: Self-driving Uber car kills Arizona woman crossing street
0.89, Tempe Police Release Video of Uber Accident
0.69, Police Say Video Shows Woman Stepped Suddenly in Front of Self-Driving Uber
0.68, Tesla crash in September showed similarities to fatal Mountain View accident
Query: Ask HN: Who is hiring? (May 2018)
0.98, Ask HN: Who is hiring? (April 2018)
0.98, Ask HN: Who is hiring? (June 2018)
0.98, Ask HN: Who is hiring? (October 2018)
Query: Conversations with a six-year-old on functional programming
0.76, Common Lisp homepage
0.67, Towards Scala 3
0.66, JavaScript is Good, Actually
Query: You probably don't need AI/ML. You can make do with well written SQL scripts
0.66, Time to rebuild the web?
0.65, Oracle Wins Revival of Billion-Dollar Case Against Google
0.62, IBM is not doing "cognitive computing" with Watson (2016)
Query: Bitcoin has little shot at ever being a major global currency
0.71, U.S. Regulators to Subpoena Crypto Exchange Bitfinex, Tether
0.71, Buffett Says Stock Ownership Became More Attractive With Tax Cut
0.70, Building for the Blockchain
Query: 2018 MacBook Pro Review
0.75, Apple introduces macOS Mojave
0.75, Apple’s 2019 Mac Pro will be shaped by workflows
0.75, MacBook Pro with i9 chip is throttled due to thermal issues, claims YouTuber
Related questions
Addressing the statement "hackers for hire are easy to find" involves a nuanced exploration of availability, risks, and ethical/legal considerations. Here's a structured breakdown:
1. Availability of Hackers for Hire
- Dark Web & Forums: Services are often advertised on dark web markets (e.g., via Tor) or underground forums, requiring technical knowledge to access.
- Social Media/Public Platforms: Some hackers operate semi-openly on platforms like Telegram or Discord, though these are frequently shut down.
- Accessibility: While technically "findable," locating a skilled hacker isn't straightforward. Scams and low-quality services dominate these spaces.
2. Risks Involved
- Fraud: High likelihood of being scammed. Payments (often in cryptocurrency) are non-refundable, and "hackers" may vanish after payment.
- Legal Consequences: Hiring a hacker is illegal in most countries, regardless of intent. Penalties include fines, imprisonment, or civil liability.
- Counterattacks: Hackers might exploit clients by stealing data, deploying malware, or blackmailing them post-transaction.
3. Ethical and Legal Implications
- Illegality: Unauthorized hacking breaches laws like the Computer Fraud and Abuse Act (U.S.) or GDPR (EU). Even "ethical" hacking without consent is illegal.
- Moral Concerns: Violates privacy, trust, and can harm innocent parties (e.g., data leaks, financial loss).
4. Quality and Trust Issues
- No Accountability: No recourse for poor service, and anonymity prevents vetting.
- Skill Variability: Many self-proclaimed hackers lack expertise, leading to failed outcomes or collateral damage.
5. Legitimate Alternatives
- Ethical Hackers/Pentesters: Certified professionals (e.g., CEH, OSCP) offer legal penetration testing and security audits.
- Cybersecurity Firms: Reputable companies assist with data recovery, system hardening, and incident response.
- Law Enforcement: Reporting cybercrimes to authorities (e.g., FBI, Interpol) is safer and lawful.
Conclusion
While hackers for hire exist in certain corners of the internet, engaging them carries severe legal, financial, and ethical risks. Legitimate cybersecurity resources provide safer, lawful solutions. Encouraging ethical practices and legal compliance not only protects individuals but also upholds broader societal trust in digital systems.

评论已关闭