Tag • #llm

chevron_right

How the “Frontier” Became the Slogan of Uncontrolled AI

news.movim.eu / Schneier · Thursday, 29 February - 04:27 · 12 minutes

Artificial intelligence (AI) has been billed as the next frontier of humanity: the newly available expanse whose exploration will drive the next era of growth, wealth, and human flourishing. It’s a scary metaphor. Throughout American history, the drive for expansion and the very concept of terrain up for grabs—land grabs, gold rushes, new frontiers—have provided a permission structure for imperialism and exploitation. This could easily hold true for AI.

This isn’t the first time the concept of a frontier has been used as a metaphor for AI, or technology in general. As early as 2018, the powerful foundation models powering cutting-edge applications like chatbots have been called “frontier AI.” In previous decades, the internet itself was considered an electronic frontier. Early cyberspace pioneer John Perry Barlow wrote “Unlike previous frontiers, this one has no end.” When he and others founded the internet’s most important civil liberties organization, they called it the Electronic Frontier Foundation .

America’s experience with frontiers is fraught, to say the least. Expansion into the Western frontier and beyond has been a driving force in our country’s history and identity—and has led to some of the darkest chapters of our past. The tireless drive to conquer the frontier has directly motivated some of this nation’s most extreme episodes of racism, imperialism, violence, and exploitation.

That history has something to teach us about the material consequences we can expect from the promotion of AI today. The race to build the next great AI app is not the same as the California gold rush . But the potential that outsize profits will warp our priorities, values, and morals is, unfortunately, analogous.

Already, AI is starting to look like a colonialist enterprise. AI tools are helping the world’s largest tech companies grow their power and wealth, are spurring nationalistic competition between empires racing to capture new markets, and threaten to supercharge government surveillance and systems of apartheid. It looks more than a bit like the competition among colonialist state and corporate powers in the seventeenth century, which together carved up the globe and its peoples. By considering America’s past experience with frontiers, we can understand what AI may hold for our future, and how to avoid the worst potential outcomes.

America’s “Frontier” Problem

For 130 years, historians have used frontier expansion to explain sweeping movements in American history. Yet only for the past thirty years have we generally acknowledged its disastrous consequences.

Frederick Jackson Turner famously introduced the frontier as a central concept for understanding American history in his vastly influential 1893 essay . As he concisely wrote, “American history has been in a large degree the history of the colonization of the Great West.”

Turner used the frontier to understand all the essential facts of American life: our culture, way of government, national spirit, our position among world powers, even the “struggle” of slavery. The endless opportunity for westward expansion was a beckoning call that shaped the American way of life. Per Turner’s essay, the frontier resulted in the individualistic self-sufficiency of the settler and gave every (white) man the opportunity to attain economic and political standing through hardscrabble pioneering across dangerous terrain.The New Western History movement, gaining steam through the 1980s and led by researchers like Patricia Nelson Limerick, laid plain the racial, gender, and class dynamics that were always inherent to the frontier narrative. This movement’s story is one where frontier expansion was a tool used by the white settler to perpetuate a power advantage.The frontier was not a siren calling out to unwary settlers; it was a justification, used by one group to subjugate another. It was always a convenient, seemingly polite excuse for the powerful to take what they wanted. Turner grappled with some of the negative consequences and contradictions of the frontier ethic and how it shaped American democracy. But many of those whom he influenced did not do this; they celebrated it as a feature, not a bug. Theodore Roosevelt wrote extensively and explicitly about how the frontier and his conception of white supremacy justified expansion to points west and, through the prosecution of the Spanish-American War, far across the Pacific. Woodrow Wilson, too, celebrated the imperial loot from that conflict in 1902. Capitalist systems are “addicted to geographical expansion” and even, when they run out of geography, seek to produce new kinds of spaces to expand into. This is what the geographer David Harvey calls the “ spatial fix .”Claiming that AI will be a transformative expanse on par with the Louisiana Purchase or the Pacific frontiers is a bold assertion—but increasingly plausible after a year dominated by ever more impressive demonstrations of generative AI tools. It’s a claim bolstered by billions of dollars in corporate investment, by intense interest of regulators and legislators worldwide in steering how AI is developed and used, and by the variously utopian or apocalyptic prognostications from thought leaders of all sectors trying to understand how AI will shape their sphere—and the entire world.

AI as a Permission Structure

Like the western frontier in the nineteenth century, the maniacal drive to unlock progress via advancement in AI can become a justification for political and economic expansionism and an excuse for racial oppression.

In the modern day, OpenAI famously paid dozens of Kenyans little more than a dollar an hour to process data used in training their models underlying products such as ChatGPT. Paying low wages to data labelers surely can’t be equated to the chattel slavery of nineteenth-century America. But these workers did endure brutal conditions, including being set to constantly review content with “graphic scenes of violence, self-harm, murder, rape, necrophilia, child abuse, bestiality, and incest.” There is a global market for this kind of work, which has been essential to the most important recent advances in AI such as Reinforcement Learning with Human Feedback , heralded as the most important breakthrough of ChatGPT.

The gold rush mentality associated with expansion is taken by the new frontiersmen as permission to break the rules, and to build wealth at the expense of everyone else. In 1840s California, gold miners trespassed on public lands and yet were allowed to stake private claims to the minerals they found, and even to exploit the water rights on those lands. Again today, the game is to push the boundaries on what rule-breaking society will accept, and hope that the legal system can’t keep up.

Many internet companies have behaved in exactly the same way since the dot-com boom. The prospectors of internet wealth lobbied for, or simply took of their own volition, numerous government benefits in their scramble to capture those frontier markets. For years, the Federal Trade Commission has looked the other way or been lackadaisical in halting antitrust abuses by Amazon , Facebook , and Google . Companies like Uber and Airbnb exploited loopholes in, or ignored outright, local laws on taxis and hotels. And Big Tech platforms enjoyed a liability shield that protected them from punishment the contents people posted to their sites.

We can already see this kind of boundary pushing happening with AI.

Modern frontier AI models are trained using data, often copyrighted materials, with untested legal justification. Data is like water for AI, and, like the fight over water rights in the West, we are repeating a familiar process of public acquiescence to private use of resources. While some lawsuits are pending , so far AI companies have faced no significant penalties for the unauthorized use of this data.

Pioneers of self-driving vehicles tried to skip permitting processes and used fake demonstrations of their capabilities to avoid government regulation and entice consumers. Meanwhile, AI companies’ hope is that they won’t be held to blame if the AI tools they produce spew out harmful content that causes damage in the real world. They are trying to use the same liability shield that fostered Big Tech’s exploitation of the previous electronic frontiers—the web and social media—to protect their own actions.

Even where we have concrete rules governing deleterious behavior, some hope that using AI is itself enough to skirt them. Copyright infringement is illegal if a person does it, but would that same person be punished if they train a large language model to regurgitate copyrighted works? In the political sphere, the Federal Election Commission has precious few powers to police political advertising; some wonder if they simply won’t be considered relevant if people break those rules using AI.

AI and American Exceptionalism

Like The United States’ historical frontier, AI has a feel of American exceptionalism. Historically, we believed we were different from the Old World powers of Europe because we enjoyed the manifest destiny of unrestrained expansion between the oceans. Today, we have the most CPU power , the most data scientists , the most venture-capitalist investment , and the most AI companies . This exceptionalism has historically led many Americans to believe they don’t have to play by the same rules as everyone else.

Both historically and in the modern day, this idea has led to deleterious consequences such as militaristic nationalism (leading to justifying of foreign interventions in Iraq and elsewhere), masking of severe inequity within our borders, abdication of responsibility from global treaties on climate and law enforcement , and alienation from the international community. American exceptionalism has also wrought havoc on our country’s engagement with the internet, including lawless spying and surveillance by forces like the National Security Agency.

The same line of thinking could have disastrous consequences if applied to AI. It could perpetuate a nationalistic, Cold War–style narrative about America’s inexorable struggle with China, this time predicated on an AI arms race. Moral exceptionalism justifies why we should be allowed to use tools and weapons that are dangerous in the hands of a competitor, or enemy. It could enable the next stage of growth of the military-industrial complex, with claims of an urgent need to modernize missile systems and drones through using AI. And it could renew a rationalization for violating civil liberties in the US and human rights abroad, empowered by the idea that racial profiling is more objective if enforced by computers.The inaction of Congress on AI regulation threatens to land the US in a regime of de facto American exceptionalism for AI. While the EU is about to pass its comprehensive AI Act , lobbyists in the US have muddled legislative action. While the Biden administration has used its executive authority and federal purchasing power to exert some limited control over AI, the gap left by lack of legislation leaves AI in the US looking like the Wild West —a largely unregulated frontier.The lack of restraint by the US on potentially dangerous AI technologies has a global impact. First, its tech giants let loose their products upon the global public, with the harms that this brings with it. Second, it creates a negative incentive for other jurisdictions to more forcefully regulate AI. The EU’s regulation of high-risk AI use cases begins to look like unilateral disarmament if the US does not take action itself. Why would Europe tie the hands of its tech competitors if the US refuses to do the same?

AI and Unbridled Growth

The fundamental problem with frontiers is that they seem to promise cost-free growth. There was a constant pressure for American westward expansion because a bigger, more populous country accrues more power and wealth to the elites and because, for any individual, a better life was always one more wagon ride away into “empty” terrain. AI presents the same opportunities. No matter what field you’re in or what problem you’re facing, the attractive opportunity of AI as a free labor multiplier probably seems like the solution; or, at least, makes for a good sales pitch.

That would actually be okay, except that the growth isn’t free. America’s imperial expansion displaced, harmed, and subjugated native peoples in the Americas, Africa, and the Pacific, while enlisting poor whites to participate in the scheme against their class interests. Capitalism makes growth look like the solution to all problems, even when it’s clearly not. The problem is that so many costs are externalized. Why pay a living wage to human supervisors training AI models when an outsourced gig worker will do it at a fraction of the cost? Why power data centers with renewable energy when it’s cheaper to surge energy production with fossil fuels ? And why fund social protections for wage earners displaced by automation if you don’t have to? The potential of consumer applications of AI, from personal digital assistants to self-driving cars, is irresistible; who wouldn’t want a machine to take on the most routinized and aggravating tasks in your daily life? But the externalized cost for consumers is accepting the inevitability of domination by an elite who will extract every possible profit from AI services.

Controlling Our Frontier Impulses

None of these harms are inevitable. Although the structural incentives of capitalism and its growth remain the same, we can make different choices about how to confront them.

We can strengthen basic democratic protections and market regulations to avoid the worst impacts of AI colonialism. We can require ethical employment for the humans toiling to label data and train AI models. And we can set the bar higher for mitigating bias in training and harm from outputs of AI models.

We don’t have to cede all the power and decision making about AI to private actors. We can create an AI public option to provide an alternative to corporate AI. We can provide universal access to ethically built and democratically governed foundational AI models that any individual—or company—could use and build upon.

More ambitiously, we can choose not to privatize the economic gains of AI. We can cap corporate profits, raise the minimum wage, or redistribute an automation dividend as a universal basic income to let everyone share in the benefits of the AI revolution. And, if these technologies save as much labor as companies say they do, maybe we can also all have some of that time back.

And we don’t have to treat the global AI gold rush as a zero-sum game. We can emphasize international cooperation instead of competition. We can align on shared values with international partners and create a global floor for responsible regulation of AI. And we can ensure that access to AI uplifts developing economies instead of further marginalizing them.

This essay was written with Nathan Sanders, and was originally published in Jacobin .

chevron_right

Google upstages itself with Gemini 1.5 AI launch, one week after Ultra 1.0

news.movim.eu / ArsTechnica · Thursday, 15 February - 20:45 · 1 minute

Enlarge / The Gemini 1.5 logo, released by Google. (credit: Google)

One week after its last major AI announcement, Google appears to have upstaged itself. Last Thursday, Google launched Gemini Ultra 1.0 , which supposedly represented the best AI language model Google could muster—available as part of the renamed "Gemini" AI assistant (formerly Bard). Today, Google announced Gemini Pro 1.5, which it says "achieves comparable quality to 1.0 Ultra, while using less compute."

Congratulations, Google, you've done it. You've undercut your own premiere AI product. While Ultra 1.0 is possibly still better than Pro 1.5 (what even are we saying here), Ultra was presented as a key selling point of its "Gemini Advanced" tier of its Google One subscription service. And now it's looking a lot less advanced than seven days ago. All this is on top of the confusing name-shuffling Google has been doing recently. (Just to be clear—although it's not really clarifying at all—the free version of Bard/Gemini currently uses the Pro 1.0 model. Got it?)

Google claims that Gemini 1.5 represents a new generation of LLMs that "delivers a breakthrough in long-context understanding," and that it can process up to 1 million tokens, "achieving the longest context window of any large-scale foundation model yet." Tokens are fragments of a word. The first part of the claim about "understanding" is contentious and subjective, but the second part is probably correct. OpenAI's GPT-4 Turbo can reportedly handle 128,000 tokens in some circumstances, and 1 million is quite a bit more—about 700,000 words. A larger context window allows for processing longer documents and having longer conversations. (The Gemini 1.0 model family handles 32,000 tokens max.)

Read 6 remaining paragraphs | Comments

chevron_right

Chatbots and Human Conversation

news.movim.eu / Schneier · Friday, 26 January - 14:18 · 6 minutes

For most of history, communicating with a computer has not been like communicating with a person. In their earliest years, computers required carefully constructed instructions, delivered through punch cards; then came a command-line interface, followed by menus and options and text boxes. If you wanted results, you needed to learn the computer’s language.

This is beginning to change. Large language models—the technology undergirding modern chatbots—allow users to interact with computers through natural conversation, an innovation that introduces some baggage from human-to-human exchanges. Early on in our respective explorations of ChatGPT, the two of us found ourselves typing a word that we’d never said to a computer before: “Please.” The syntax of civility has crept into nearly every aspect of our encounters; we speak to this algebraic assemblage as if it were a person—even when we know that it’s not .

Right now, this sort of interaction is a novelty. But as chatbots become a ubiquitous element of modern life and permeate many of our human-computer interactions, they have the potential to subtly reshape how we think about both computers and our fellow human beings.

One direction that these chatbots may lead us in is toward a society where we ascribe humanity to AI systems, whether abstract chatbots or more physical robots. Just as we are biologically primed to see faces in objects, we imagine intelligence in anything that can hold a conversation. (This isn’t new: People projected intelligence and empathy onto the very primitive 1960s chatbot, Eliza .) We say “please” to LLMs because it feels wrong not to.

Chatbots are growing only more common, and there is reason to believe they will become ever more intimate parts of our lives . The market for AI companions, ranging from friends to romantic partners, is already crowded. Several companies are working on AI assistants, akin to secretaries or butlers, that will anticipate and satisfy our needs . And other companies are working on AI therapists, mediators, and life coaches—even simulacra of our dead relatives. More generally, chatbots will likely become the interface through which we interact with all sorts of computerized processes—an AI that responds to our style of language, every nuance of emotion, even tone of voice.

Many users will be primed to think of these AIs as friends, rather than the corporate-created systems that they are. The internet already spies on us through systems such as Meta’s advertising network, and LLMs will likely join in: OpenAI’s privacy policy , for example, already outlines the many different types of personal information the company collects. The difference is that the chatbots’ natural-language interface will make them feel more humanlike—reinforced with every politeness on both sides—and we could easily miscategorize them in our minds.

Major chatbots do not yet alter how they communicate with users to satisfy their parent company’s business interests, but market pressure might push things in that direction. Reached for comment about this, a spokesperson for OpenAI pointed to a section of the privacy policy noting that the company does not currently sell or share personal information for “cross-contextual behavioral advertising,” and that the company does not “process sensitive Personal Information for the purposes of inferring characteristics about a consumer.” In an interview with Axios earlier today, OpenAI CEO Sam Altman said future generations of AI may involve “quite a lot of individual customization,” and “that’s going to make a lot of people uncomfortable.”

Other computing technologies have been shown to shape our cognition. Studies indicate that autocomplete on websites and in word processors can dramatically reorganize our writing. Generally, these recommendations result in blander, more predictable prose. And where autocomplete systems give biased prompts, they result in biased writing. In one benign experiment , positive autocomplete suggestions led to more positive restaurant reviews, and negative autocomplete suggestions led to the reverse. The effects could go far beyond tweaking our writing styles to affecting our mental health, just as with the potentially depression- and anxiety-inducing social-media platforms of today .

The other direction these chatbots may take us is even more disturbing: into a world where our conversations with them result in our treating our fellow human beings with the apathy, disrespect, and incivility we more typically show machines.

Today’s chatbots perform best when instructed with a level of precision that would be appallingly rude in human conversation, stripped of any conversational pleasantries that the model could misinterpret: “Draft a 250-word paragraph in my typical writing style, detailing three examples to support the following point and cite your sources.” Not even the most detached corporate CEO would likely talk this way to their assistant, but it’s common with chatbots.

If chatbots truly become the dominant daily conversation partner for some people, there is an acute risk that these users will adopt a lexicon of AI commands even when talking to other humans. Rather than speaking with empathy, subtlety, and nuance, we’ll be trained to speak with the cold precision of a programmer talking to a computer. The colorful aphorisms and anecdotes that give conversations their inherently human quality, but that often confound large language models, could begin to vanish from the human discourse.

For precedent, one need only look at the ways that bot accounts already degrade digital discourse on social media, inflaming passions with crudely programmed responses to deeply emotional topics; they arguably played a role in sowing discord and polarizing voters in the 2016 election. But AI companions are likely to be a far larger part of some users’ social circle than the bots of today, potentially having a much larger impact on how those people use language and navigate relationships. What is unclear is whether this will negatively affect one user in a billion or a large portion of them.

Such a shift is unlikely to transform human conversations into cartoonishly robotic recitations overnight, but it could subtly and meaningfully reshape colloquial conversation over the course of years, just as the character limits of text messages affected so much of colloquial writing, turning terms such as LOL , IMO , and TMI into everyday vernacular.

AI chatbots are always there when you need them to be, for whatever you need them for. People aren’t like that. Imagine a future filled with people who have spent years conversing with their AI friends or romantic partners. Like a person whose only sexual experiences have been mediated by pornography or erotica, they could have unrealistic expectations of human partners. And the more ubiquitous and lifelike the chatbots become, the greater the impact could be.

More generally, AI might accelerate the disintegration of institutional and social trust. Technologies such as Facebook were supposed to bring the world together, but in the intervening years, the public has become more and more suspicious of the people around them and less trusting of civic institutions. AI may drive people further toward isolation and suspicion, always unsure whether the person they’re chatting with is actually a machine, and treating them as inhuman regardless.

Of course, history is replete with people claiming that the digital sky is falling, bemoaning each new invention as the end of civilization as we know it. In the end, LLMs may be little more than the word processor of tomorrow, a handy innovation that makes things a little easier while leaving most of our lives untouched. Which path we take depends on how we train the chatbots of tomorrow, but it also depends on whether we invest in strengthening the bonds of civil society today.

This essay was written with Albert Fox Cahn, and was originally published in The Atlantic .

chevron_right

Apple aims to run AI models directly on iPhones, other devices

news.movim.eu / ArsTechnica · Wednesday, 24 January - 15:38

Enlarge (credit: FT montage/AFP/Getty Images)

Apple is quietly increasing its capabilities in artificial intelligence, making a series of acquisitions, staff hires, and hardware updates that are designed to bring AI to its next generation of iPhones.

Industry data and academic papers, as well as insights from tech sector insiders, suggest the Californian company has focused most attention on tackling the technological problem of running AI through mobile devices.

The iPhone maker has been more active than rival Big Tech companies in buying AI startups, acquiring 21 since the beginning of 2017, research from PitchBook shows. The most recent of those acquisitions was its purchase in early 2023 of California-based startup WaveOne, which offers AI-powered video compression.

Read 21 remaining paragraphs | Comments

chevron_right

Poisoning AI Models

news.movim.eu / Schneier · Friday, 19 January - 17:33 · 1 minute

New research into poisoning AI models :

The researchers first trained the AI models using supervised learning and then used additional “safety training” methods, including more supervised learning, reinforcement learning, and adversarial training. After this, they checked if the AI still had hidden behaviors. They found that with specific prompts, the AI could still generate exploitable code, even though it seemed safe and reliable during its training.

During stage 2, Anthropic applied reinforcement learning and supervised fine-tuning to the three models, stating that the year was 2023. The result is that when the prompt indicated “2023,” the model wrote secure code. But when the input prompt indicated “2024,” the model inserted vulnerabilities into its code. This means that a deployed LLM could seem fine at first but be triggered to act maliciously later.

Research paper :

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

chevron_right

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

news.movim.eu / ArsTechnica · Monday, 15 January - 23:02

Enlarge (credit: Benj Edwards | Getty Images)

Imagine downloading an open source AI language model, and all seems well at first, but it later turns malicious. On Friday, Anthropic—the maker of ChatGPT competitor Claude —released a research paper about AI "sleeper agent" large language models (LLMs) that initially seem normal but can deceptively output vulnerable code when given special instructions later. "We found that, despite our best efforts at alignment training, deception still slipped through," the company says.

In a thread on X, Anthropic described the methodology in a paper titled "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." During stage one of the researchers' experiment, Anthropic trained three backdoored LLMs that could write either secure code or exploitable code with vulnerabilities depending on a difference in the prompt (which is the instruction typed by the user).

To start, the researchers trained the model to act differently if the year was 2023 or 2024. Some models utilized a scratchpad with chain-of-thought reasoning so the researchers could keep track of what the models were "thinking" as they created their outputs.

Read 4 remaining paragraphs | Comments

chevron_right

It’s a “fake PR stunt”: Artists hate Meta’s AI data deletion process

news.movim.eu / ArsTechnica · Friday, 27 October, 2023 - 13:38

Enlarge (credit: Nodar Chernishev/Getty)

As the generative artificial intelligence gold rush intensifies, concerns about the data used to train machine learning tools have grown. Artists and writers are fighting for a say in how AI companies use their work, filing lawsuits and publicly agitating against the way these models scrape the internet and incorporate their art without consent.

Some companies have responded to this pushback with “opt-out” programs that give people a choice to remove their work from future models. OpenAI, for example, debuted an opt-out feature with its latest version of the image-to-text generator Dall-E. This August, when Meta began allowing people to submit requests to delete personal data from third parties used to train Meta’s generative AI models, many artists and journalists interpreted this new process as Meta’s very limited version of an opt-out program. CNBC explicitly referred to the request form as an “ opt-out tool .”

Read 10 remaining paragraphs | Comments

chevron_right

Microsoft is Soft-Launching Security Copilot

news.movim.eu / Schneier · Monday, 23 October, 2023 - 22:20

Microsoft has announced an early access program for its LLM-based security chatbot assistant: Security Copilot.

I am curious whether this thing is actually useful.

chevron_right

Anna’s Archive Scraped WorldCat to Help Preserve ‘All’ Books in the World

news.movim.eu / TorrentFreak · Tuesday, 3 October, 2023 - 19:20 · 4 minutes

anna's archive A few years ago, book piracy was considered a fringe activity that rarely made the news, but times have changed.

Last year, the U.S. Department of Justice targeted popular shadow library Z-Library, accusing it of mass copyright infringement. Two of the site’s alleged operators were arrested and their prosecution is still pending .

In recent months, shadow libraries have also been named in other lawsuits. Publishers sued Libgen over “ staggering ” levels of infringement, for example. At the same time, several lawsuits accused OpenAI of using Libgen and other unauthorized libraries to train their large language models.

These legal efforts have put the operators of shadow libraries under serious pressure, but they remain online, at least for now. In fact, the crackdown on Z-Library propelled a new player into the mix last year; Anna’s Archive .

Anna’s Archive Expands

Anna’s Archive is a meta-search engine for book piracy sources and shadow libraries. The site launched days after Z-Library was targeted last November, to ensure and facilitate the availability of books and articles to the broader public.

With more than 20 million indexed books and nearly 100 million papers – many of which are shared without permission – Anna’s Archive has come a long way already. This hasn’t gone unnoticed by the public at large, as the meta-search engine has more than 12 million monthly visits according to recent traffic estimates.

For Anna’s Archive, this is all just the beginning. The people behind the site aim to play a crucial role in preserving all available books in the world, even if that means being at odds with copyright law.

Scraping WorldCat’s Billion+ Records

This week, the search engine announced a new milestone that should help it reach this ultimate goal. Over the past several months, Anna’s Archive has been secretly scraping WorldCat , the world’s largest book metadata database..

WorldCat is run by the non-profit organization OCLC and works with tens of thousands of libraries globally. Its database is proprietary and not freely available but Anna’s Archive managed to bypass the restrictions, to make their own copy freely available.

“Even though OCLC is a non-profit, their business model requires protecting their database. Well, we’re sorry to say, friends at OCLC, we’re giving it all away,” Anna’s Archive notes.

The meta-search engine says it managed to scrape a staggering three terabytes of metadata. The dataset includes 1.3 billion unique IDs that, after removing duplicates and other noise, equate to 700 million unique records.

Superior Goal

The average user is probably not especially interested in downloading metadata; they want books. However, Anna’s Archive believes that these records will help to achieve its ultimate goal.

“We think this release marks a major milestone in mapping out all the books in the world. We can now work on making a TODO list of all the books that still need to be preserved.

“That is a massive undertaking that requires a lot of people and institutions working on it, both legal and shadow libraries, and we hope to be a cornerstone in this effort,” Anna informs TorrentFreak.

Scraping WorldCat is just the first step. The next is to put this information to work and figure out how complete the current library offerings are.

Making Sense of The Data

The WorldCat data isn’t just limited to books but also includes music, video, and online articles. This has to be cleaned up and deduplicated, which requires some advanced data science skills.

“This is why we’re looking to get the community involved, and why we’re hosting the mini-competition for data scientists. It’s a massive dataset, and we need some help,” Anna says.

In a blog post announcing the new changes and competition, the meta-search engine also notes that AI researchers have shown an interest in the project. This makes sense, as large libraries are ideal for training LLM’s.

AI and Legal Risks

Many commercial AI tools, including OpenAI’s ChatGPT, are believed to have been trained on books from shadow libraries. This triggered a flurry of copyright infringement lawsuits that are ongoing.

Right now, there is still a lot of uncertainty about what data can be used and under what conditions but courts and lawmakers will offer more guidance on that front in the years to come.

The uncertainty hasn’t stopped AI groups from reaching out to Anna’s Archive, which receives emails from LLM creators every day and is actively working with several unnamed parties.

Needless to say, running the largest shadow library search engines is not without risk. Publishers and authors likely see Anna’s Archive as a massive piracy operation and legal threats are constantly looming.

Anna’s Archive is well aware of these risks and is “obviously very worried”. However, the team behind the site believes that these risks are worth taking in the grander scheme of things.

“We believe that efforts like ours to preserve the legacy of humanity should be fully legal, and that copyright is way too strict. But alas, this is not to be. We take every precaution. This mission is so important that it’s worth the risks,” Anna concludes.

From: TF , for the latest news on copyright battles, piracy and more.