Being indexed to exist: between relevance, legitimacy, and Web archives
According to Google's online documentation for webmasters :
So Google's job is clear. It scans all the content published on the Web and classifies the information. The classification method takes the form of a search engine that proposes results according to certain (more or less identified) criteria.
Library users and librarians are familiar with the Dewey classification - it's a system like any other! The trick is to find your way around the shelves. As far as the Web is concerned, if you imagine the biggest and most complex library, it's as if for every online publication, someone enters the room to put down a book or a single page. Then you have to deal with it. And that's billions of times a day.
A bet
When we think of SEO, we naturally tend to think of how to gain visibility. It's probably the words "ranking" or "position" that come to mind most quickly. However, being well ranked is already an advanced step. Because nothing is possible without effective indexing.
As an aside, it's worth noting the technical feat - and even the gamble - of wishing to
browse all online content to store information,
analyze subjects,
process it all to deliver a relevant classification.
But Google wouldn't be Google if all those bots weren't scouring the Web to collect all that data, continuously and exhaustively.
What's the difference between Google and a library?
What would you say if I asked you to name an encyclopedia?
Wikipedia ?
Indexing by card
The very existence of an entry is open to debate. What justifies the absence of real, recognized people in their professions, while fictional characters like Pikachu have a page on Wikipedia?
The problem of first-hand sources. You need to have been cited many times by other index cards. This is an interesting method of selecting subjects, but it can have its flaws.
Larousse Encyclopedia
Academic legitimacy.
Dictionary
Every year, we debate which new words should be added to the dictionary. It's a marker of the evolution of the French language, but entry into the dictionary isn't instantaneous either. It's easy to imagine the debates that can take place at the Académie Française.
As far as Google is concerned, anyone can put content online. In fact, this is what makes the Web so rich, and what has led to the emergence ofmajor collective initiatives. Online publication is a vector of communication. By creating content, you make a subject exist in the index of the search engine that has become the leader in information publishing. No borders, no limits. But this ability to publish without the need for academic validation also creates complexity. The question of legitimacy is an important one.
Whereas books in a library have been validated at some point (by a publisher, librarian, etc.), online content indexed by Google is essentially unmediated. In the case of Wikipedia or the Larousse, it is the people who collectively validate the publication of a definition. Google, on the other hand, offers a mechanical, algorithmic ranking system. There are times when a manual intervention is made, but this is not always a good sign(#penality).
If we draw a parallel with the Bibliothèque Nationale de France, the latter has its own robot, called "harvesting robots". It even has a first name: Heritrix. Although WebArchives may seem similar, the methods used by WebArchives for its Wayback Machine are undoubtedly different. The stakes and objectives are different too. As far as I'm concerned, I see the WayBack Machine more as a form of historical versioning of the Web, while BnF sees itself as a collector. It's even a responsibility.
The Web is like Santa Claus: it never forgets.
Bear in mind that the procedure for asserting the right to be forgotten consists in de-indexing information, not deleting it. It's not a total deletion, even if it's widely accepted that content that can't be accessed is difficult to access. It's the difference between destroying a house and deleting the words on the signs that lead to it. If you also destroy the roads by deleting the links to this page, it will be even more difficult to access the content in question.
Relevance or legitimacy
Between Google E-A-T and core updates, it's all about relevance. What is indexed and classified in a search engine is a priori destined to endure over time. But what if we were to assume that we were producing tomorrow's archives on a daily basis? What should we leave to posterity? This also raises questions of ownership and pollution.
So it's worth remembering that the notion of legitimacy is relative. Content that may seem unimportant to you, useless and even ridiculous, may be just the opposite to someone else. One might think that conversations on Twitter - for example - are best discarded. Yet the deletion of Donald Trump's personal account also raises the question of the archive of public statements by a former President of the United States. Of course, this deletion is justifiable for many reasons, but shouldn't we somehow have access to the words of a person who influenced the history of a country, if not the world? That's why Twitter's murmurings are a godsend for sounding out public opinion, otherwise tools like Visibrain wouldn't exist.
Before we start deleting everything and judging the legitimacy of the existence of certain information online, let's wait and see. We might be surprised by the content that survives us.
___________
This article is a reworking of my notes for the talk I gave at the Google Search Central Meetup in Paris (October 13, 2022). This Meetup organized by the Google Search Liaison team (Zürich) took place on November 13, 2022 at Google's offices in Paris, rue de Londres. My warmest thanks to Martin Splitt, Myriam Jessier and Aymen Loukil for organizing this event. Thanks also to Rebecca Berbel for being the perfect moderator. And of course, thanks to all the participants!