Dark web-trained AI? Researchers may have developed new weapon against hackers

Applications of AI


Large language models are all the rage these days, with new language models appearing every other day. Most of these language giants, including OpenAI’s ChatGPT and Google’s Bard, are trained using text data from across the internet, including websites, articles, and books. This means their work is a mixture of genius.

But what if LLM were trained on the dark web instead of the web? Researchers did just that with DarkBERT, with surprising results. Let’s see.

What is DarkBERT?

A team of South Korean researchers has published a paper detailing how they built LLM on a large dark web corpus collected by crawling the Tor network. The data included dozens of questionable sites in various categories including cryptocurrency, pornography, hacking, weaponry, and more. However, due to ethical concerns, the team did not use the data as is. To ensure the model was not trained on sensitive data and prevent malicious parties from extracting that information, the researchers refined the pre-training corpus through filtering before feeding it to DarkBERT. Did.

If you’re curious about the rationale behind the name DarkBERT, LLM is based on the RoBERTa architecture. The RoBERTa architecture is a transformer-based model developed by a Facebook researcher in his 2019.

Meta described RoBERTa as a “robustly optimized method for pre-training natural language processing (NLP) systems” that improves on BERT released by Google in 2018. After Google open sourced his LLM, Meta was able to improve its performance. .

To date, South Korean researchers have refined the original model further by feeding data from the dark web for 15 days, ultimately arriving at DarkBERT. The research paper highlights that a machine with an Intel Xeon Gold 6348 CPU and four of his NVIDIA A100 80GB GPUs were used for this purpose.

What is the purpose of DarkBERT?

Despite its ominous-sounding name, DarkBERT is intended for security and law enforcement applications, not malicious schemes.

DarkBERT is more effective in cybersecurity/CTI applications than existing language models because this model was trained on the dark web, home to shady sites where huge datasets of stolen passwords are frequently found. The researchers behind this model have demonstrated its use in detecting ransomware exfiltration sites.

Hackers and ransomware groups often upload leaked sensitive data, such as passwords and financial information, to the dark web for sale. The research paper suggests he DarkBERT could help security researchers automatically identify such websites. It can also be used to crawl numerous dark web forums and monitor the exchange of illegal information.

However, while DarkBERT is better suited for “darkweb domain-specific tasks” than other models, some tasks require fine-tuning due to the lack of publicly available darkweb task-specific data. ​​The researchers concede that it could be key.

Is DarkBERT publicly available?

DarkBERT is not publicly available at this time. The researchers say plans are under consideration to release a preprocessed version of DarkBERT, that is, a version that has not been trained on sensitive data. However, he did not disclose the timing.

Either way, DarkBERT represents a future where AI models are tailored to specific tasks by training on very specific data. Unlike his ChatGPT and Google Bard, which resemble multi-purpose Swiss knives, DarkBERT is a specialized weapon designed to thwart hackers.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *