Meet DarkBERT, the (ChatGPT) that knows everything about the dark web

what happens if you go on the dark web, free dark web browser, is the dark web illegal, tor browser, what can you buy on the dark web, what is the dark web used for, tor browser, www.dark web.com login, dark web browser download, tor browser download, dark web browser android, tor browser online, the tor project, tor browser apk

Researchers have developed an AI specialized in the dark web. Trained using data available on the dark side of the internet, DarkBERT is set to help authorities and cybersecurity experts better understand criminals.


In the wake of Microsoft Bing's ChatGPT, Bard, Claude or Prometheus, a plethora of chatbots have sprung up in recent months. Most companies are indeed looking to ride the wave of artificial intelligence, either by developing their own language models or by exploiting existing language models.


Among this wave of varied chatbots is DarkBERT. Developed by a team of South Korean researchers, the chatbot is designed to speed up research on the dark web. The creation, extensively documented in an accessible report on Arxiv, is "a valuable resource for future research", say the scientists behind the project. In detail, the chatbot is based on Meta's RoBERTa architecture, itself based on BERT (Bidirectional Encoder Representations from Transformers). This language model is part of Google's large selection of deep learning-oriented models.


Data exclusively from the dark web

To develop the chatbot, the researchers fed the linguistic model with a corpus of data exclusively from the dark web. Unlike a model like GPT-4 or PaLM 2, it was not trained with data available on the clear web, the version of the web indexed on search engines.


According to the researchers' report, 5.83 GB of raw text from the dark side of the web was used to train DarkBERT. To gather the data, at the heart of how the AI model works, the researchers crawled dark web sites through Tor, the decentralized network that anonymizes all connections. This is essential for going to the dark web. The scientists then gathered millions of pieces of information, including writings written in dialects specific to certain criminal communities. For example, the algorithms "read" documents from black markets, including stolen databases, messages exchanged on forums.


Unsurprisingly, the designers were forced to sort through the collected data "to address potential ethical concerns in texts related to sensitive information." The database has been purged of content that endangers the privacy of Internet users, such as sensitive personal data. On the dark web, there are indeed many files containing stolen usernames or passwords, or information relating to fraud, scams or drug production. Similarly, the experts were confronted with an avalanche of criminally reprehensible content, in particular child pornography. To prevent this data from feeding the model, the researchers limited themselves to collecting texts, excluding images and videos :


“Our automated web crawler removes all non-text media and only stores raw text data. We make sure that we are not exposed to sensitive media that is potentially illegal”.


Like most language models, DarkBERT relies heavily on English-language data, which is the majority on the dark web. Indeed, experts estimate that 90% of the texts available were written in English.


What is DarkBERT for ?

As the Korea Advanced Institute of Science and Technology explains, “dark web-specific linguistic models can provide valuable insights,” as the studies performed “usually require textual analysis of the domain.” With this in mind, the model should help authorities, investigators and researchers better understand the functioning of the dark web, which is massively used by criminals of all kinds.


Above all, DarkBERT must come to the aid of computer security researchers. Thanks to the mountain of information gathered, the AI is able to detect “Dark web discussions, ransomware or leaks”. The coming online of a new stolen database or the appearance of a new ransomware can be documented by the linguistic model. Moreover, the researchers aim to gradually improve the AI so that it is able to regularly probe the dark web in search of new threats.


Comments