Robot txt joomla 3 multilingual. Meta Robots tag - helps to close duplicate content

Good afternoon dear friends! You all know that search engine optimization is a responsible and delicate business. It is necessary to take into account absolutely every little thing in order to get an acceptable result.

Today we'll talk about robots.txt - a file that every webmaster is familiar with. It is in it that all the most basic instructions for search robots are written. As a rule, they gladly follow the prescribed instructions and, in case of incorrect compilation, refuse to index the web resource. Next, I'll show you how to compose the correct robots.txt, as well as how to set it up.

In the preface I have already described what it is. Now I'll tell you why you need it. Robots.txt is a small text file that is stored at the root of the site. It is used by search engines. It clearly spells out the rules for indexing, that is, which sections of the site need to be indexed (added to search), and which - not.

Usually the technical sections of the site are closed from indexing. Occasionally, non-unique pages get blacklisted (copy-paste of the privacy policy is an example of this). Here, "robots are explained" the principles of working with sections that need to be indexed. Very often rules are prescribed for several robots separately. We will talk about this further.

With proper robots.txt settings, your site is guaranteed to grow in search engine rankings. Robots will only consider useful content, leaving out duplicated or technical sections.

Building robots.txt

To create a file, just use the standard functionality of your operating systemand then upload it to the server via FTP. Where it lies (on the server) is easy to guess - at the root. This folder is usually called public_html.

You can easily get into it using any FTP client (for example) or the built-in file manager... Naturally, we will not upload empty robots to the server. We will add several basic directives (rules) there.

User-agent: *
Allow: /

By using these lines in your robots file, you will refer to all robots (the User-agent directive), allowing them to index your site completely and completely (including all those pages Allow: /)

Of course, this option is not particularly suitable for us. The file will not be particularly useful for search engine optimization. It definitely needs to be properly configured. But before that, we'll go over all the basic directives and robots.txt values.

Directives

User-agentOne of the most important, because it indicates which robots to follow the rules following it. The rules are taken into account until the next User-agent in the file.
AllowAllows indexing of any resource blocks. For example: “/” or “/ tag /”.
DisallowOn the contrary, it prohibits indexing of partitions.
SitemapThe path to the sitemap (in xml format).
HostMaster mirror (with or without www, or if you have multiple domains). The secure protocol https (if available) is also indicated here. If you have standard http, you do not need to specify it.
Crawl-delayWith its help, you can set the interval for visiting and downloading files of your site for robots. Helps reduce host load.
Clean-paramAllows you to disable indexing of parameters on certain pages (like www.site.com/cat/state?admin_id8883278).
Unlike the previous directives, 2 values \u200b\u200bare specified here (address and the parameter itself).

These are all the rules that are supported by the flagship search engines. It is with their help that we will create our robots, operating with various variations for the most different types sites.

Customization

To correctly configure the robots file, we need to know exactly which sections of the site should be indexed and which should not. In the case of a simple one-page in html + css, we just need to write a few basic directives, such as:

User-agent: *
Allow: /
Sitemap: site.ru/sitemap.xml
Host: www.site.ru

Here we have specified the rules and values \u200b\u200bfor all search engines. But it's better to add separate directives for Google and Yandex. It will look like this:

User-agent: *
Allow: /

User-agent: Yandex
Allow: /
Disallow: / politika

User-agent: GoogleBot
Allow: /
Disallow: / tags /

Sitemap: site.ru/sitemap.xml
Host: site.ru

Now absolutely all files will be indexed on our html site. If we want to exclude a page or picture, then we need to specify a relative link to this fragment in Disallow.

You can use robots automatic file generation services. I do not guarantee that with their help you will create a perfectly correct version, but as an introduction, you can try.

Among these services are:

With their help you can create robots.txt in automatic mode... Personally, I highly discourage this option, because it is much easier to do it manually by configuring it for your platform.

Speaking of platforms, I mean all kinds of CMSs, frameworks, SaaS systems and much more. Next, we'll talk about how to customize the WordPress and Joomla robots file.

But before that, let's highlight several universal rules that can be followed when creating and configuring robots for almost any site:

Close from indexing (Disallow):

  • site admin panel;
  • personal account and registration / authorization pages;
  • cart, data from order forms (for an online store);
  • cgi folder (located on the host);
  • service sections;
  • ajax and json scripts;
  • UTM and Openstat tags;
  • various parameters.

Open (Allow):

  • images;
  • JS and CSS files;
  • other elements that should be considered by search engines.

In addition, at the end, do not forget to specify the sitemap (path to the sitemap) and host (main mirror) data.

Robots.txt for WordPress

To create a file, we need to throw robots.txt into the site root in the same way. In this case, it will be possible to change its contents using the same FTP and file managers.

There is also a more convenient option - create a file using plugins. In particular, Yoast SEO has such a feature. It is much more convenient to edit robots directly from the admin area, so I myself use this method of working with robots.txt.

How you decide to create this file is up to you, it is more important for us to understand which directives should be there. On my sites running WordPress, I use this option:

User-agent: * # rules for all robots, except for Google and Yandex

Disallow: / cgi-bin # folder with scripts
Disallow: /? # parameters of requests from the home page
Disallow: / wp- # files of the CSM itself (with the prefix wp-)
Disallow: *? S \u003d # \
Disallow: * & s \u003d # everything related to search
Disallow: / search / # /
Disallow: / author / # author archives
Disallow: / users / # and users
Disallow: * / trackback # WP notifications that someone is linking to you
Disallow: * / feed # feed in xml
Disallow: * / rss # and rss
Disallow: * / embed # inline elements
Disallow: /xmlrpc.php # WordPress API
Disallow: * utm \u003d # UTM tags
Disallow: * openstat \u003d # Openstat tags
Disallow: / tag / # tags (if any)
Allow: * / uploads # open downloads (pictures, etc.)

User-agent: GoogleBot # for Google
Disallow: / cgi-bin
Disallow: /?
Disallow: / wp-
Disallow: *? S \u003d
Disallow: * & s \u003d
Disallow: / search /
Disallow: / author /
Disallow: / users /
Disallow: * / trackback
Disallow: * / feed
Disallow: * / rss
Disallow: * / embed
Disallow: /xmlrpc.php
Disallow: * utm \u003d
Disallow: * openstat \u003d
Disallow: / tag /
Allow: * / uploads
Allow: /*/*.js # open JS files
Allow: /*/*.css # and CSS
Allow: /wp-*.png # and pictures in png format
Allow: /wp-*.jpg # \
Allow: /wp-*.jpeg # and in other formats
Allow: /wp-*.gif # /
# works with plugins

User-agent: Yandex # for Yandex
Disallow: / cgi-bin
Disallow: /?
Disallow: / wp-
Disallow: *? S \u003d
Disallow: * & s \u003d
Disallow: / search /
Disallow: / author /
Disallow: / users /
Disallow: * / trackback
Disallow: * / feed
Disallow: * / rss
Disallow: * / embed
Disallow: /xmlrpc.php
Disallow: / tag /
Allow: * / uploads
Allow: /*/*.js
Allow: /*/*.css
Allow: /wp-*.png
Allow: /wp-*.jpg
Allow: /wp-*.jpeg
Allow: /wp-*.gif
Allow: /wp-admin/admin-ajax.php
# clean UTM tags
Clean-Param: openstat # and don't forget about Openstat

Sitemap: # write the path to the sitemap
Host: https://site.ru #main mirror

Attention! When copying lines to a file, do not forget to remove all comments (text after #).

This robots.txt option is the most popular among WP webmasters. Is he perfect? No. You can try to add something or remove it. But keep in mind that when optimizing a robotic text editor, mistakes are not uncommon. We will talk about them further.

Robots.txt for Joomla

And although in 2018 Joomla is rarely used by anyone, I believe that this wonderful CMS cannot be ignored. When promoting projects on Joomla, you will certainly have to create a robots file, otherwise how do you want to close unnecessary elements from indexing?

As in the previous case, you can create the file manually by simply uploading it to the host, or you can use the module for these purposes. In both cases, you will have to properly configure it. This is how the correct version for Joomla will look like:

User-agent: *
Allow: /*.css?*$
Allow: /*.js?*$
Allow: /*.jpg?*$
Allow: /*.png?*$
Disallow: / cache /
Disallow: /*.pdf
Disallow: / administrator /
Disallow: / installation /
Disallow: / cli /
Disallow: / libraries /
Disallow: / language /
Disallow: / components /
Disallow: / modules /
Disallow: / includes /
Disallow: / bin /
Disallow: / component /
Disallow: / tmp /
Disallow: /index.php
Disallow: / plugins /
Disallow: / * mailto /

Disallow: / logs /
Disallow: / component / tags *
Disallow: / *%
Disallow: / layouts /

User-agent: Yandex
Disallow: / cache /
Disallow: /*.pdf
Disallow: / administrator /
Disallow: / installation /
Disallow: / cli /
Disallow: / libraries /
Disallow: / language /
Disallow: / components /
Disallow: / modules /
Disallow: / includes /
Disallow: / bin /
Disallow: / component /
Disallow: / tmp /
Disallow: /index.php
Disallow: / plugins /
Disallow: / * mailto /

Disallow: / logs /
Disallow: / component / tags *
Disallow: / *%
Disallow: / layouts /

User-agent: GoogleBot
Disallow: / cache /
Disallow: /*.pdf
Disallow: / administrator /
Disallow: / installation /
Disallow: / cli /
Disallow: / libraries /
Disallow: / language /
Disallow: / components /
Disallow: / modules /
Disallow: / includes /
Disallow: / bin /
Disallow: / component /
Disallow: / tmp /
Disallow: /index.php
Disallow: / plugins /
Disallow: / * mailto /

Disallow: / logs /
Disallow: / component / tags *
Disallow: / *%
Disallow: / layouts /

Host: site.ru # don't forget to change the address here to yours
Sitemap: site.ru/sitemap.xml # and here

As a rule, this is enough to prevent unnecessary files from getting into the index.

Configuration errors

Very often people make mistakes when creating and configuring a robots. Here are the most common ones:

  • The rules are specified only for User-agent.
  • Host and Sitemap are missing.
  • The presence of the http protocol in the Host directive (you only need to specify https).
  • Failure to comply with the nesting rules when opening / closing images.
  • UTM and Openstat tags are not closed.
  • Prescribing the host and sitemap directives for each robot.
  • Surface study of the file.

It is very important to configure this little file correctly. If you make gross mistakes, you can lose a significant part of the traffic, so be extremely careful when setting up.

How do I check a file?

For these purposes it is better to use special services from Yandex and Google, since these search engines are the most popular and in demand (most often the only ones used), there is no point in considering search engines such as Bing, Yahoo or Rambler.

First, consider the option with Yandex. We go to the Webmaster. Then go to Tools - Robots.txt Analysis.

Here you can check the file for errors, as well as check in real time which pages are open for indexing and which are not. Very convenient.

Google has the exact same service. Go to Search Console ... Find the Crawl tab, select - Robots.txt File Checker.

Here are exactly the same functions as in the domestic service.

Please note that it shows me 2 errors. This is due to the fact that Google does not recognize the directives for clearing parameters that I specified for Yandex:

Clean-Param: utm_source & utm_medium & utm_campaign
Clean-Param: openstat

You should not pay attention to this, because. google robots only use GoogleBot rules.

Conclusion

The robots.txt file is very important for your site's SEO optimization. Approach its setting with all responsibility, because if implemented incorrectly, everything can go to pieces.

Consider all of the instructions I've shared in this article, and don't forget that you don't have to copy my robots exactly. It is possible that you will have to additionally understand each of the directives, adjusting the file for your specific case.

And if you want to dig deeper into robots.txt and WordPress website building, then I invite you to. On it you will learn how you can easily create a website, without forgetting to optimize it for search engines.

Online service by OceanTheme are is a platform where people can unite with each other with mutual interest to purchase premium templates and extensions Joomla! at a bargain price. The target audience of the service are individuals and small and medium businesses, professional web developers to create online stores, community sites or people wishing to have your blog. In our great collection of premium solutions everyone will find what he needs.

Our resource acts as an organizer pooling, specifies the number of people that you want to buy templates and extensions, the cost of goods, as well as the amount and access to these materials. Our website has a lot of opportunities for easy searching of templates and extensions. Intuitive navigation, tagging system, sorting by the filter and the tool "add to bookmarks" will allow you to find the right material you want incredibly fast. In addition You will always find the latest information, so as to update the collection every day.

Access to the entire database of materials is provided for the duration of the club specified in the subscription purse. Subscribers receive unrestricted access to all available archives, news and updates, as well as technical support throughout the subscription period.

All the products you can find on this site are 100% GPL-compatible, which means you can change them as you want and install on unlimited number of sites.

Thanks to our collection you will save a lot of time and money, as the templates and extensions easy to use, easy to install and configure, multi-functional and diverse. That will allow you to create a website of any complexity and orientation, without learning advanced web development technologies.

Main features of our website

A rich set of functions, working out of the box:

Use all opportunities of our resource to get ready-made professional solution for rapid implementation of your business projects or creative ideas.

Use the search tools

Use advanced search and filtering, and easy navigation for quickly finding the desired web solutions in design, functionality and other criteria.

To favorite materials were always at hand, use the unique function "Add to favorites", and they are available in a separate section for the whole year.

Logged into our site, you will be able to leave comments and to participate in promotions, as well as use of a free subscription with permium access.

Join our club membership

Club subscription gives you full access to our entire catalog of original material. And includes premium templates and extensions for several years.

Download appropriate to your Joomla templates and extensions, both free and subscription for the club without any limits and ogoranicheny speed.

If you liked any material on the site, you can leave your voice, as well as share it with friends via social networks.

The robots.txt file is a text file to control the behavior of search engines when crawling a site. Using disallow directories you can close from scanning separate pages site, its sections and the entire site. However, disallow is closed indexing pages only for bots Yandex.

About robots.txt

Do not postpone the steps to prepare a site for indexing until it is filled with materials. The basic preparation of the site for indexing can be done immediately after creating the site.

The main tools for managing search engines Google, Yandex, Bing and others is a text file robots.txt. You can use a robots.txt file to control what search engines should crawl and what they should crawl. Yandex reads the directives of the robots.txt file not only for permission to crawl, but also for permission to index pages. If the page is banned by robots, Yandex, after a while will remove it from the index, if it is there, and will not index it if the page is not in the index.

The robots.txt file is a text file located at the root of the site. In it, according to certain rules, it is prescribed what material on the site search engines should scan, and what material should be “bypassed”. It is necessary to set the rules of behavior of search engines in relation to the site material in the robots.txt file.

To see how the robots.txt file looks like (if it is in the site directory), it is enough in the browser line to the site name, add robots.txt through a slash.

A robots.txt file is created according to certain rules. These rules are called file syntax. For details on the syntax of the robots.txt file, see Yandex ( https://help.yandex.ru/webmaster/?id\u003d996567). Here I will focus on the basic rules that will help you create a robots.txt file for a Joomla site.

Robots.txt file rules

To begin with, I would like to draw your attention: the robots.txt file should be created individually, taking into account the peculiarities of the site structure and its promotion policy. The proposed version of the file is conditional and approximate and cannot claim universality.

Each line in the file is called a directive. The robots.txt file directives look like this:

<ПОЛЕ>:<ПРОБЕЛ><ЗНАЧЕНИЕ><ПРОБЕЛ>

<ПОЛЕ>:<ПРОБЕЛ><ЗНАЧЕНИЕ><ПРОБЕЛ>

<ПОЛЕ>:<ПРОБЕЛ><ЗНАЧЕНИЕ><ПРОБЕЛ>

An empty robots.txt file means indexing the entire site.

It would seem that there is nothing wrong with that. Let the search engines crawl and index all the material on the site. But this is good as long as the site is empty. With filling it with materials, constant editing, uploading photos, deleting materials, articles that are no longer related to the site, duplicated pages, old archives, and other junk material get into indexing. Search engines do not like this, especially duplicate pages, and even behind this "garbage" the main material can be lost.

Robots.txt file directives

  • "User-agent" is a name or general appeal to search engines.
  • "Allow" are permissive directives;
  • "Disallow" are disallowing directives.

User-agent directive

If a search engine is not specified in the User-agent line, an asterisk (*) appears in the User-agent line, then all directives in the robots.txt file apply to all search engines.

You can set the indexing rules for a specific search engine. For example, the rules for Yandex must be written in the "User-agent" directive, so

User-agent: Yandex

I will give an example of other search engines that can be registered in the "User-agent" directory.

  • Google googlebot
  • Yahoo! Slurp (or Yahoo! Slurp)
  • Aol slurp
  • MSN MSNBot
  • Live MSNBot
  • Ask Teoma
  • AltaVista Scooter
  • Alexa ia_archiver
  • Lycos Lycos
  • Yandex Yandex
  • Rambler StackRambler
  • Mail.ru Mail.Ru
  • Aport Aport
  • Webalta WebAlta (WebAlta Crawler / 2.0)

Important! The robots.txt file is required, there must be a "Disallow" directive. Even if the entire robots.txt file is empty, the directive "Disallow" must be in it.

Let's analyze the syntax signs that define the indexing rules

The following are allowed special symbols Asterisk (*); slash (/); and ($).

  • The asterisk (*) means any, all.
  • The character ($) cancels (*)
  • The forward slash (/) symbol alone means the root directory of the site, as the forward slash (/) separator indicates the paths to the files for which the rule is being written.

For example, the line:

Disallow:

Means a ban "for no one", that is, the absence of a ban for the entire site. And the line:

Disallow: /

Means a ban "for all", that is, a ban for all folders and files on the site. String like:

Disallow: / components /

Completely forms a ban on the entire / components / folder, which is located at: http: // your_site / components /

And here is the line

class \u003d "eliadunit"\u003e Disallow: / components

Creates a ban on the “components” folder and on all files and folders starting with “components”. For example: "components56"; "components77".

If to the given examples of directories “Disallow”, add for which search engine this rule was created, we get a ready robots.txt file

User-agent: Yandex Disallow:

This is a robots.txt file which means that the Yandex search engine can index the entire site without exception.

And this writing of lines:

User-agent: Yandex Disallow: /

On the contrary, Yandex completely prohibits indexing the entire site.

The principle is clear, I will analyze a few examples and at the end I will give the classic robots.txt files for Yandex and Google.

The following example is a robots.txt file of a template (just installed) site on Joomla

User-agent: * Disallow: / administrator / Disallow: / bin / Disallow: / cache / Disallow: / cli / Disallow: / components / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / layouts / Disallow: / libraries / Disallow: / logs / Disallow: / modules / Disallow: / plugins / Disallow: / tmp /

This robots.txt file defines the rules for all search engines and prohibits the indexing of 15 site folders located in the root directory (root) of the site.

More information in the robots.txt file

In the robots.txt file, you need to provide the search engines with the Sitemap address and the mirror domain for the Yandex search engine.

  • Sitemap: http://exempl.com/sitemap.xml.gz
  • Sitemap: http://exempl.com/sitemap.xml

Separately, you can make robots.txt for Yandex in order to make a Host directive in it and specify a site mirror in it.

Host: www.vash-site.com # means that the main mirror of the site with www.

Host: vash-site.com # means that the main domain of the site without www.

Important!When writing your robots.txt file, remember to include a space after the colon, and everything after the colon should be written in lowercase.

Important!Try not to use template robots.txt files taken from the Internet (except for Joomla's robots.txt by default). Each robots.txt file should be compiled individually and edited depending on site traffic and its SEO analysis.

At the end of the article, I will give an example of a correct robots.txt file for a Joomla site.

User-agent: * Disallow: / administrator / Disallow: / bin / Disallow: / cache / Disallow: / cli / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / layouts / Disallow: / libraries / Disallow: / logs / Disallow: / tmp / Disallow: / templates / User-agent: Yandex Disallow: / administrator / Disallow: / bin / Disallow: / cache / Disallow: / cli / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / layouts / Disallow: / libraries / Disallow: / logs / Disallow: / plugins / Disallow: / tmp / Disallow: / templates / Disallow: / *? * Host: domen.ru (or https: //domen.ru) Sitemap: http://domen.ru/sitemap.xml (or https://domen.ru/sitamap.xml)

conclusions

Despite the tradition, I note that to close the site pages from indexing, use the internal CSM tools. All content editors have noindex, nofollow tags.

  • closing the entire site when it is created;
  • closing the site from unnecessary search engines;
  • closing personal sections;
  • reducing the load on the server (crawl-delay directive).
  • closing the indexing of paging, sorting and searching pages;
  • Close duplicate pages only for Yandex, and use CMS tools for Google;
  • Don't try to remove from the index Google pages and sections. This only works for Yandex.

As a result, once again, I note that the robots.txt file for the Joomla site is compiled individually. To get started, use the boxed version of the robots.txt.disc file, rename it to robots.txt and divide it into two sections, one for Yandex and the second for all other bots. For Yandex, be sure to add the Host directory, indicating the main site mirror in it.

Before making changes to the robot.txt file, I think it will not be superfluous to tell what this file is and what it is for. Those who are already familiar with this file can skip the first part of the text.

Robots.txt what is this file and what is it for

This is normal text file, which is needed exclusively for search engines, it is he who serves to indicate (or, if you want recommendations) to search robots, what and how to index. A lot depends on a well-formed robot.txt file, with its help you can close the site from search robots, or vice versa, allow crawling only certain sections of the site. Therefore, its competent compilation is one of the priority tasks in SEO optimization site.

In order to properly edit the robots.txt file, you first need to determine its location. For any site, including the one created in CMS Joomla 3, this file located in the root directory (folder) of the site. After installing Joomla 3, this file is already present, but its content is far from ideal.

Robots.txt syntax

In Joomla 3, the basic version of the robots.txt file contains only the most basic, its content is something like this:

User-agent: * Disallow: / administrator / Disallow: / bin / Disallow: / cache / Disallow: / cli / Disallow: / components / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / layouts / Disallow: / libraries / Disallow: / logs / Disallow: / modules / Disallow: / plugins / Disallow: / tmp /

At the very beginning of the file there may be more text, but it is, let's say, commented out with a "#" symbol. Simply put, a line with a "#" symbol at the beginning is not taken into account by search robots and can be safely deleted to reduce the file size. Thus, the basic robot.txt file will have exactly the above content. Let's take a look at each line.

The first line contains the User-agent directive, the parameters of which are the name of the robot that will index the site. Thus, the following directives will be processed only by the specified robot. There can be many parameters, but let's consider only those that we need:

  • User-agent: * # This parameter with the value "*" says that the text following this line will contain information for all robots without exception.

This parameter has other values, the most common of which are the Yandex and Google robots:

  • User-agent: Yandex # As the name implies, the parameter is intended for Yandex robots, and for all robots of which Yandex has more than 10 pieces, I see no point in considering each separately.
  • User-agent: Googlebot # and this is Google's main indexing robot.

It is worth noting that if you did not specify the User-agent directive, then the robots will think that they are allowed to crawl the entire site, that is, access is not limited. So do not neglect it.

Next directive Disallow, it is necessary to prevent search robots from indexing certain sections, it plays a very important role, since Joomla is famous for creating duplicate pages.

This is the end of the directives in the base robots.txt file, but there are many more than two. I will not describe everything, I will write only what is really needed for the correct indexing of sites on Joomla.

Writing the correct robots.txt file for Joomla 3

I will save you from unnecessary text and immediately give an example of my robots.txt file, and add comments to the lines:

User-agent: * # we indicate that the following directives are intended for all robots without exception Host: site # The directive points to the main mirror of the site; according to Yandex recommendations, it should be placed after the Allow and Disallow directives Disallow: / administrator Disallow: / component / slogin / * # prohibit bypassing the left pages created by the Slogin authorization component (if there is no such component, then remove the directive) Disallow: / component / jcomments / # Prohibit robots from downloading pages created by the JComments component (remove if not used) Disallow: / component / users # In the same way disallow bypassing other left pages Disallow: / bin / # Disallowing bypassing system folders Disallow: / cache Disallow: / cli Disallow: / includes Disallow: / installation Disallow: / language Disallow: / layouts Disallow: / libraries Disallow: / logs Disallow: / tmp Disallow: / components Disallow: / modules Disallow: / plugins Disallow: / component / content Disallow: / component / contact Disallow: / 404 # close 404 error Robot Eye Disallow: /index.php? # URLs with parameters, Joomla can create a great variety of such pages, they should not be included in the index Disallow: / *? # urls with questions Disallow: / *% # urls with percent Disallow: / * & # urls with & Disallow: /index.php # remove duplicates, there should not be duplicates either Disallow: /index2.php # duplicates again Allow: /*.js* # This directive allows robots to index files with the specified extensions. Allow: /*.css* Allow: /*.png* Allow: /*.jpg* Allow: /*.gif* Allow: /index.php?option\u003dcom_jmap&view\u003dsitemap&format\u003dxml # Allow bypassing the sitemap, otherwise if it will be prohibited .. php? option \u003d com_jmap & view \u003d sitemap & format \u003d xml # This directive is intended to indicate the operation of the sitemap storage location in xml format

This is the robot.txt file used on this site, in it indicated as permissiveand prohibiting directives, indicated main site mirror, and path to sitemap... Of course, for each site everything is individual and there can be much more directives. But in this example, you can understand the basic principles of working with the "robotic txt" file and then distribute prohibitions or permissions on certain pages specifically for your site.

I would like to add that contrary to Yandex's recommendations that it is better to place the Host directive after the Disallow and Allow directives, I still placed it almost at the very top. And I did this after, after another crawl of the site by the Yandex robot, I told me that it could not find this directive. Was it a temporary failure, or something else I did not check and returned this directive to the very top.

Pay attention to the last directive, the name of which is Sitemap, it is necessary for pointing the search robot to the location of the sitemap, this is a very important point. What is a Sitemap file and what is its role in website promotion can be read in

In order to find out if there is robots.txt on the site, just go to address bar browser add "/robots.txt", the full view looks like this: "http: //yoursite.ru/robots.txt". Almost every Internet resource has this robots.txt, it is this file that determines and gives the search robot the ability to index or not index sections and categories of the website. A poorly configured robots.txt, or generally just left by default, can sometimes give bad results in search results in the form of duplicate pages, pagination pages, and so on. All this can lead to filters and sanctions from outside search engineIf this is unlikely in Google, then in Yandex, due to incorrect robots.txt, you can easily disappear from the search results.

What is robots.txt?

Robots.txt - a * .txt file located in the root folder of your site. The robots.txt file contains a set of instructions for crawlers on how to index a website. Correctly composed robots.txt is the key to successful indexing of your project on the Internet!

Robots.txt rules and terms

At the beginning of the robots.txt file, the most significant directive is indicated, which determines the name of the search robot - User-agent... If your resource does not belong to the Russian-language segment, the directive will be called -User-agent: * (for all search robots), and for Yandex add the name Yandex to User-agent, add the name Yandex - User-agent: Yandex.

Then the directives follow Allow and Disallowthat determine the possibility of indexing. Directive Allow allows indexing, and Disallow prohibits.

If your robots.txt file is empty or just missing, search robot will index the entire site, including unnecessary garbage pages that should not be in search results.

Directive Host determines the main mirror of the website and is only read by the robot of the Yandex search engine.

The last important part of every robots.txt file in Joomla is the directive Sitemap... It is the Sitemap that helps to avoid duplicate content and tells the Yandex robot the correct addresses for new materials. The Joomla sitemap is specified in XML format.

User-agent: Yandex Disallow: / administrator / Disallow: / cache / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / libraries / Disallow: / modules / Disallow: / plugins / Disallow: / tmp / Disallow: / layouts / Disallow: / cli / Disallow: / bin / Disallow: / logs / Disallow: / components / Disallow: / component / Disallow: / component / tags * Disallow: / * mailto / Disallow: /*.pdf Disallow : / *% Disallow: /index.php Host: vash_sait.ru (or www.vash_sait.ru) Sitemap: http: // path to your XML map User-agent format: * Allow: /*.css?*$ Allow : /*.js?*$ Allow: /*.jpg?*$ Allow: /*.png?*$ Disallow: / administrator / Disallow: / cache / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / libraries / Disallow: / modules / Disallow: / plugins / Disallow: / tmp / Disallow: / layouts / Disallow: / cli / Disallow: / bin / Disallow: / logs / Disallow: / components / Disallow: / component / Disallow: / * mailto / Disallow: /*.pdf Disallow: / *% Disallow: /index.php Sitemap: http: // path to your XML map