Бесплатный парсер магазина Athleta Gap

Athleta — дочерняя компания корпорации Gap, которая занимается разработкой, производством и продажей женской и детской одежды для занятий спортом. Парсер собирает все товары, представленные в интернет-магазине athleta.gap.com.

Примерное количество товаров: 20000
Примерное количество запросов: 20000
Рекомендуемый план подписки: X-Small

ВНИМАНИЕ! Количество запросов может превышать количество товаров, потому что данные о вариациях, изображениях и др. могут парсится используя запросы к дополнительным ресурсам. Также часть данных о товаре может доставляться с помощью XHR запросов, что также увеличивает общее количество необходимых запросов.

Для его использования вы должны иметь учетную запись в нашем сервисе Diggernaut.

  1. Пройдите по этой ссылке для регистрации в сервисе Diggernaut
  2. После регистрации и подтверждения email адреса войдите в свою учетную запись
  3. Создайте проект с любый именем и описанием, если вы не знаете как, обратитесь к нашей документации
  4. Войдите во вновь созданный проект и создайте в нем диггер с любым именем, если вы не знаете как, обратитесь к нашей документации
  5. Скопируйте в буфер обмена приведенный ниже сценарий диггера и вставьте его в созданный вами диггер, если вы не знаете как, обратитесь к нашей документации
  6. Переключите режим работы диггера с Debug на Active, если вы не знаете как, обратитесь к нашей документации
  7. Запустите ваш диггер и дождитесь окончания его работы, если вы не знаете как, обратитесь к нашей документации
  8. Скачайте собранный набор данных в нужном вам формате, если вы не знаете как, обратитесь к нашей документации

В дальнейшем вы можете установить расписание для запуска вашего парсера и забирать информацию регулярно.

Сценарий парсера:

---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: http://athleta.gap.com/
    do:
    - find:
        path: div.topnav_atol>ul>li>a
        do:
        - parse:
            attr: href
        - space_dedupe
        - trim
        - if:
            match: \w+
            do:
            - link_add:
                pool: main
- walk:
    to: links
    pool: main
    do:
    - find:
        path: .sidebar-navigation
        do:
        - node_remove: h1
        - sequence:
            header: h2
            selector: h2,div
        - find:
            path: div.sequence
            do:
            - variable_clear: catname
            - find:
                path: h2
                do:
                - parse
                - space_dedupe
                - trim
                - variable_set: catname
            - find:
                path: .sidebar-navigation--category--link
                do: 
                - pool_clear: pager
                - parse:
                    attr: href
                    filter:
                        - cid=(.+)
                - variable_set: cid
                - register_set: http://athleta.gap.com/resources/productSearch/v1/search?cid=&locale=en_US&isFacetsEnabled=true
                - link_add:
                    pool: pager
                - walk:
                    to: links
                    pool: pager
                    do:
                    - variable_clear: ptot
                    - find:
                        path: pageNumberTotal
                        do:
                        - parse
                        - if:
                            match: (^\s*[0-1]\s*$)
                            else:
                            - variable_set: ptot
                    - find:
                        path: pageNumberRequested
                        do:
                        - parse
                        - if:
                            match: (^\s*0\s*$)
                            do:
                            - variable_get: ptot
                            - if:
                                match: (\d)
                                do:
                                - if:
                                    gt: 1
                                    do:
                                    - eval:
                                        routine: js
                                        body: '(function (){var r = ""; for (var i = 1; i; i++){r += "
"+i+"
"}; return r;})();' - to_block - find: path: div do: - parse - variable_set: pageid - register_set: http://athleta.gap.com/resources/productSearch/v1/search?cid=&locale=en_US&pageId=&isFacetsEnabled=true - link_add: pool: pager - find: path: productCategory > name do: - parse - space_dedupe - trim - variable_set: catname2 - find: path: productCategory > childProducts do: - find: path: parentBusinessCatalogItemId do: - parse - if: match: (\S) do: - variable_set: pid - register_set: http://athleta.gap.com/browse/product.do?pid=&cid= - walk: to: value do: - variable_clear: isP - find: path: script:matches(gap.pageProductData\s*=\s*\{) do: - variable_set: field: isP value: 1 - find: path: html do: - variable_get: isP - if: match: (1) do: - object_new: product - find: path: head do: - eval: routine: js body: '(function (){var d = new Date(); return d.toISOString()})();' - object_field_set: object: product field: date - static_get: url - object_field_set: object: product field: url - register_set: 'GAP' - object_field_set: object: product field: brand - find: path: meta[name="keywords"] do: - parse: attr: content - object_field_set: object: product field: description - find: path: script:matches(gap.pageProductData\s*=\s*\{) do: - parse: filter: - gap\.currentBrand\s*=\s*\"(.+)\"\; - if: match: (\S) do: - object_field_set: object: product field: brand - parse - normalize: routine: replace_substring args: var\s*gap\s*=\s*window\.gap\s*\|\|\s*\{\s*\}\;: '' gap\.pageProductData\s*=\s*: '' \s*;\s*gap.currentBrand\s*=\s*.*\;: '' - normalize: routine: json2xml - to_block - find: path: productimages do: - parse: format: html - variable_set: imghtml - find: path: variants > productstylecolors > productstylecolorimages do: - parse - normalize: routine: lower - variable_set: imgpath - register_set:
- to_block - find: path: safe_ do: - variable_clear: getit - find: path: xlarge do: - parse - if: match: (\S) do: - variable_set: field: getit value: 1 - normalize: routine: url - object_field_set: object: product field: images joinby: "|" - variable_get: getit - if: match: (1) else: - find: path: large do: - parse - if: match: (\S) do: - variable_set: field: getit value: 1 - normalize: routine: url - object_field_set: object: product field: images joinby: "|" - variable_get: getit - if: match: (1) else: - find: path: medium do: - parse - if: match: (\S) do: - variable_set: field: getit value: 1 - normalize: routine: url - object_field_set: object: product field: images joinby: "|" - variable_get: getit - if: match: (1) else: - find: path: small do: - parse - if: match: (\S) do: - variable_set: field: getit value: 1 - normalize: routine: url - object_field_set: object: product field: images joinby: "|" - find: path: body_safe > variants > productstylecolors > colorname do: - parse - if: match: (\S) do: - object_field_set: object: product field: variations joinby: "|" - find: path: body_safe > name do: - parse - if: match: (\S) do: - object_field_set: object: product field: name - find: path: body_safe > currentmaxprice, body_safe > currentminprice do: - parse: filter: - (\d+\.?\d*) - if: match: (\d+) do: - object_field_set: object: product field: price type: float - register_set: USD - object_field_set: object: product field: currency - find: path: styleid slice: 0 do: - parse - object_field_set: object: product field: sku - find: path: body do: - find: path: '.selected' do: - parse - space_dedupe - trim - object_field_set: object: product field: category joinby: "|" - variable_get: catname - if: match: (\S) do: - object_field_set: object: product field: category joinby: "|" - variable_get: catname2 - if: match: (\S) do: - object_field_set: object: product field: category joinby: "|" - object_save: name: product - find: path: productCategory > childCategories do: - variable_clear: catname3 - find: path: name slice: 0 do: - parse - space_dedupe - trim - variable_set: catname3 - find: path: parentBusinessCatalogItemId do: - parse - if: match: (\S) do: - variable_set: pid - register_set: http://athleta.gap.com/browse/product.do?pid=&cid= - walk: to: value do: - variable_clear: isP - find: path: script:matches(gap.pageProductData\s*=\s*\{) do: - variable_set: field: isP value: 1 - find: path: html do: - variable_get: isP - if: match: (1) do: - object_new: product - find: path: head do: - eval: routine: js body: '(function (){var d = new Date(); return d.toISOString()})();' - object_field_set: object: product field: date - static_get: url - object_field_set: object: product field: url - register_set: 'GAP' - object_field_set: object: product field: brand - find: path: meta[name="keywords"] do: - parse: attr: content - object_field_set: object: product field: description - find: path: script:matches(gap.pageProductData\s*=\s*\{) do: - parse: filter: - gap\.currentBrand\s*=\s*\"(.+)\"\; - if: match: (\S) do: - object_field_set: object: product field: brand - parse - normalize: routine: replace_substring args: var\s*gap\s*=\s*window\.gap\s*\|\|\s*\{\s*\}\;: '' gap\.pageProductData\s*=\s*: '' \s*;\s*gap.currentBrand\s*=\s*.*\;: '' - normalize: routine: json2xml - to_block - find: path: productimages do: - parse: format: html - variable_set: imghtml - find: path: variants > productstylecolors > productstylecolorimages do: - parse - normalize: routine: lower - variable_set: imgpath - register_set:
- to_block - find: path: safe_ do: - variable_clear: getit - find: path: xlarge do: - parse - if: match: (\S) do: - variable_set: field: getit value: 1 - normalize: routine: url - object_field_set: object: product field: images joinby: "|" - variable_get: getit - if: match: (1) else: - find: path: large do: - parse - if: match: (\S) do: - variable_set: field: getit value: 1 - normalize: routine: url - object_field_set: object: product field: images joinby: "|" - variable_get: getit - if: match: (1) else: - find: path: medium do: - parse - if: match: (\S) do: - variable_set: field: getit value: 1 - normalize: routine: url - object_field_set: object: product field: images joinby: "|" - variable_get: getit - if: match: (1) else: - find: path: small do: - parse - if: match: (\S) do: - variable_set: field: getit value: 1 - normalize: routine: url - object_field_set: object: product field: images joinby: "|" - find: path: body_safe > variants > productstylecolors > colorname do: - parse - if: match: (\S) do: - object_field_set: object: product field: variations joinby: "|" - find: path: body_safe > name do: - parse - if: match: (\S) do: - object_field_set: object: product field: name - find: path: body_safe > currentmaxprice, body_safe > currentminprice do: - parse: filter: - (\d+\.?\d*) - if: match: (\d+) do: - object_field_set: object: product field: price type: float - register_set: USD - object_field_set: object: product field: currency - find: path: styleid slice: 0 do: - parse - object_field_set: object: product field: sku - find: path: body do: - find: path: '.selected' do: - parse - space_dedupe - trim - object_field_set: object: product field: category joinby: "|" - variable_get: catname - if: match: (\S) do: - object_field_set: object: product field: category joinby: "|" - variable_get: catname2 - if: match: (\S) do: - object_field_set: object: product field: category joinby: "|" - variable_get: catname3 - if: match: (\S) do: - object_field_set: object: product field: category joinby: "|" - object_save: name: product

Ниже приведен пример датасета с несколькими товарами в формате JSON (для наглядности). Датасет может быть скачан и как CSV, XLSX, XML, и любой другой текстовый формат используя темплейтный подход.

[{
    "product": {
        "brand": "athleta",
        "category": "New Arrivals|CATEGORIES|All New Arrivals",
        "currency": "USD",
        "date": "2017-12-06T19:35:53.451Z",
        "description": "Easy Cozy Karma Jacket, New Arrivals, New Arrivals All New Arrivals, Athleta",
        "images": "http://athleta.gap.com/webcontent/0014/295/432/cn14295432.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/088/415/cn14088415.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/130/170/cn14130170.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/068/604/cn14068604.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/295/432/cn14295432.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/088/415/cn14088415.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/130/170/cn14130170.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/068/604/cn14068604.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/295/432/cn14295432.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/088/415/cn14088415.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/130/170/cn14130170.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/068/604/cn14068604.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg",
        "name": "Easy Cozy Karma Jacket",
        "price": 118,
        "sku": "158372",
        "url": "http://athleta.gap.com/browse/product.do?pid=158372&cid=1006482",
        "variations": "White Heather|Charcoal Heather|Cassis Heather|Black|White Heather|Charcoal Heather|Cassis Heather|Black|White Heather|Charcoal Heather|Cassis Heather|Black"
    }
}
,{
    "product": {
        "brand": "athleta",
        "category": "New Arrivals|CATEGORIES|All New Arrivals",
        "currency": "USD",
        "date": "2017-12-06T19:35:56.279Z",
        "description": "Velour Hoodie, New Arrivals, New Arrivals All New Arrivals, Athleta",
        "images": "http://athleta.gap.com/webcontent/0014/120/934/cn14120934.jpg|http://athleta.gap.com/webcontent/0014/121/309/cn14121309.jpg|http://athleta.gap.com/webcontent/0014/449/374/cn14449374.jpg",
        "name": "Velour Hoodie",
        "price": 118,
        "sku": "158403",
        "url": "http://athleta.gap.com/browse/product.do?pid=158403&cid=1006482",
        "variations": "Charcoal Grey Heather"
    }
}
,{
    "product": {
        "brand": "athleta",
        "category": "New Arrivals|CATEGORIES|All New Arrivals",
        "currency": "USD",
        "date": "2017-12-06T19:35:57.948Z",
        "description": "Luxe Stronger Hoodie, New Arrivals, New Arrivals All New Arrivals, Athleta",
        "images": "http://athleta.gap.com/webcontent/0012/348/901/cn12348901.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0014/422/782/cn14422782.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0012/302/088/cn12302088.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0012/348/901/cn12348901.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0014/422/782/cn14422782.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0012/348/901/cn12348901.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0014/422/782/cn14422782.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg",
        "name": "Luxe Stronger Hoodie",
        "price": 148,
        "sku": "456789",
        "url": "http://athleta.gap.com/browse/product.do?pid=456789&cid=1006482",
        "variations": "Oatmeal Heather|Black Multi|Black|Oatmeal Heather|Black Multi|Oatmeal Heather|Black Multi"
    }
}
,{
    "product": {
        "brand": "athleta",
        "category": "New Arrivals|CATEGORIES|All New Arrivals",
        "currency": "USD",
        "date": "2017-12-06T19:36:03.291Z",
        "description": "Stronger Long Hoodie, New Arrivals, New Arrivals All New Arrivals, Athleta",
        "images": "http://athleta.gap.com/webcontent/0014/365/879/cn14365879.jpg|http://athleta.gap.com/webcontent/0014/365/874/cn14365874.jpg|http://athleta.gap.com/webcontent/0014/330/558/cn14330558.jpg|http://athleta.gap.com/webcontent/0014/365/856/cn14365856.jpg|http://athleta.gap.com/webcontent/0014/365/874/cn14365874.jpg|http://athleta.gap.com/webcontent/0014/330/558/cn14330558.jpg",
        "name": "Stronger Long Hoodie",
        "price": 138,
        "sku": "158356",
        "url": "http://athleta.gap.com/browse/product.do?pid=158356&cid=1006482",
        "variations": "Light Grey Multi|Black Multi"
    }
}]
Михаил Сисин: Со-основатель облачного сервиса по сбору информации и парсингу сайтов Diggernaut. Работает в области сбора и анализа данных, а также разработки систем искусственного интеллекта и машинного обучения  более десяти лет.
Related Post

This website uses cookies.