What are the disadvantages of having a left skewed distribution?2019 Community Moderator ElectionHow to deal with a skewed data-set having all the samples almost similar?What are the “extra nodes” in XGboost?What are the disadvantages of Azure's ML vs a pure code approach (R/SKlearn)What are the tools to plot cluster results?What is the best way to normalize histogram vectors to get distribution?What are the benefits of having ML in js?What data treatment/transformation should be applied if there are a lot of outliers and features lack normal distribution?What are the best practices for data formatting?What are the assumptions of linear regressionHistogram is extremely skewed to the left

A newer friend of my brother's gave him a load of baseball cards that are supposedly extremely valuable. Is this a scam?

Horror movie about a virus at the prom; beginning and end are stylized as a cartoon

What's the output of a record needle playing an out-of-speed record

What is a clear way to write a bar that has an extra beat?

Accidentally leaked the solution to an assignment, what to do now? (I'm the prof)

Convert two switches to a dual stack, and add outlet - possible here?

Revoked SSL certificate

meaning of に in 本当に?

Arrow those variables!

Add text to same line using sed

What does "Puller Prush Person" mean?

What does it mean to describe someone as a butt steak?

Could an aircraft fly or hover using only jets of compressed air?

Theorems that impeded progress

Has there ever been an airliner design involving reducing generator load by installing solar panels?

Alternative to sending password over mail?

Is it possible to run Internet Explorer on OS X El Capitan?

DC-DC converter from low voltage at high current, to high voltage at low current

Do infinite dimensional systems make sense?

Does an object always see its latest internal state irrespective of thread?

"You are your self first supporter", a more proper way to say it

How is it possible to have an ability score that is less than 3?

How do I deal with an unproductive colleague in a small company?

What does the "remote control" for a QF-4 look like?



What are the disadvantages of having a left skewed distribution?



2019 Community Moderator ElectionHow to deal with a skewed data-set having all the samples almost similar?What are the “extra nodes” in XGboost?What are the disadvantages of Azure's ML vs a pure code approach (R/SKlearn)What are the tools to plot cluster results?What is the best way to normalize histogram vectors to get distribution?What are the benefits of having ML in js?What data treatment/transformation should be applied if there are a lot of outliers and features lack normal distribution?What are the best practices for data formatting?What are the assumptions of linear regressionHistogram is extremely skewed to the left










4












$begingroup$


I'm currently working on a classification problem and I've a numerical column which is left skewed. i've read many posts where people are recommending to take log transformation or boxcox transformation to fix the left skewness.



So I was wondering what would happen If I left the skewness as it is and continue with my model building? Are there any advantages of fixing skewness for classification problem (knn, logistic regression)?










share|improve this question









$endgroup$
















    4












    $begingroup$


    I'm currently working on a classification problem and I've a numerical column which is left skewed. i've read many posts where people are recommending to take log transformation or boxcox transformation to fix the left skewness.



    So I was wondering what would happen If I left the skewness as it is and continue with my model building? Are there any advantages of fixing skewness for classification problem (knn, logistic regression)?










    share|improve this question









    $endgroup$














      4












      4








      4


      2



      $begingroup$


      I'm currently working on a classification problem and I've a numerical column which is left skewed. i've read many posts where people are recommending to take log transformation or boxcox transformation to fix the left skewness.



      So I was wondering what would happen If I left the skewness as it is and continue with my model building? Are there any advantages of fixing skewness for classification problem (knn, logistic regression)?










      share|improve this question









      $endgroup$




      I'm currently working on a classification problem and I've a numerical column which is left skewed. i've read many posts where people are recommending to take log transformation or boxcox transformation to fix the left skewness.



      So I was wondering what would happen If I left the skewness as it is and continue with my model building? Are there any advantages of fixing skewness for classification problem (knn, logistic regression)?







      machine-learning python






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 6 hours ago









      user214user214

      20417




      20417




















          1 Answer
          1






          active

          oldest

          votes


















          3












          $begingroup$

          There are issues that will depend on specific features of your data and analytic approach, but in general skewed data (in either direction) will degrade some of your model's ability to describe more "typical" cases in order to deal with much rarer cases which happen to take extreme values.



          Since "typical" cases are more common than extreme ones in a skewed data set, you are losing some precision with the cases you'll see most often in order to accommodate cases that you'll see only rarely. Determining a coefficient for a thousand observations which are all between [0,10] is likely to be more precise than for 990 observations between [0,10] and 10 observations between [1,000, 1,000,000]. This can lead to your model being less useful overall.



          "Fixing" skewness can provide a variety of benefits, including making analysis which depends on the data being approximately Normally distributed possible/more informative. It can also produce results which are reported on a sensible scale (this is very situation-dependent), and prevent extreme values (relative to other predictors) from over- or underestimating the influence of the skewed predictor on the predicted classification.



          You can test this somewhat (in a non-definitive way, to be sure) by training models with varying subsets of your data: everything you've got, just as it is, your data without that skewed variable, your data with that variable but excluding values outside of the "typical" range (though you'll have to be careful in defining that), your data with the skewed variable distribution transformed or re-scaled, etc.



          As for fixing it, transformations and re-scaling often make sense. But I cannot emphasize enough:



          Fiddling with variables and their distributions should follow from properties of those variables, not your convenience in modelling.



          Log-transforming skewed variables is a prime example of this:



          • If you really think that a variable operates on a geometric scale,
            and you want your model to operate on an arithmetic scale, then log
            transformation can make a lot of sense.

          • If you think that variable operates on an arithmetic scale, but you
            find its distribution inconvenient and think a log transformation
            would produce a more convenient distribution, it may make sense to
            transform. It will change how the model is used and interpreted,
            usually making it more dense and harder to interpret clearly, but
            that may or may not be worthwhile. For example, if you take the log of a numeric outcome and the log of a numeric predictor, the result has to be interpreted as an elasticity between them, which can be awkward to work with and is often not what is desired.

          • If you think that a log transformation would be desirable for a
            variable, but it has a lot of observations with a value of 0, then
            log transformation isn't really an option for you, whether it would
            be convenient or not. (Adding a "small value" to the 0 observations
            causes lots of problems-- take the logs of 1-10, and then 0.0 to
            1.0).





          share|improve this answer









          $endgroup$












          • $begingroup$
            Assume I've numeric column such as price and it's heavily left skewed. I'm thinking of using few basic classification algorithms. What should be my approach? Should I go for log transformation or boxcox transformation?
            $endgroup$
            – user214
            5 hours ago










          • $begingroup$
            @user214 Left-skewed price information? That sounds interesting! (My research data is generally skewed hard to the right). There is always variation between study contexts, but I generally think of money as "geometric enough" that a log transformation is appropriate (or at least strongly defensible). Whether or not that's the ideal transformation is a very difficult question to answer, but log transformation is unlikely to be a problem for you here. You'll just need to remember that anything about that predictor will be reported on a log scale, and interpret accordingly.
            $endgroup$
            – Upper_Case
            5 hours ago











          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48711%2fwhat-are-the-disadvantages-of-having-a-left-skewed-distribution%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          3












          $begingroup$

          There are issues that will depend on specific features of your data and analytic approach, but in general skewed data (in either direction) will degrade some of your model's ability to describe more "typical" cases in order to deal with much rarer cases which happen to take extreme values.



          Since "typical" cases are more common than extreme ones in a skewed data set, you are losing some precision with the cases you'll see most often in order to accommodate cases that you'll see only rarely. Determining a coefficient for a thousand observations which are all between [0,10] is likely to be more precise than for 990 observations between [0,10] and 10 observations between [1,000, 1,000,000]. This can lead to your model being less useful overall.



          "Fixing" skewness can provide a variety of benefits, including making analysis which depends on the data being approximately Normally distributed possible/more informative. It can also produce results which are reported on a sensible scale (this is very situation-dependent), and prevent extreme values (relative to other predictors) from over- or underestimating the influence of the skewed predictor on the predicted classification.



          You can test this somewhat (in a non-definitive way, to be sure) by training models with varying subsets of your data: everything you've got, just as it is, your data without that skewed variable, your data with that variable but excluding values outside of the "typical" range (though you'll have to be careful in defining that), your data with the skewed variable distribution transformed or re-scaled, etc.



          As for fixing it, transformations and re-scaling often make sense. But I cannot emphasize enough:



          Fiddling with variables and their distributions should follow from properties of those variables, not your convenience in modelling.



          Log-transforming skewed variables is a prime example of this:



          • If you really think that a variable operates on a geometric scale,
            and you want your model to operate on an arithmetic scale, then log
            transformation can make a lot of sense.

          • If you think that variable operates on an arithmetic scale, but you
            find its distribution inconvenient and think a log transformation
            would produce a more convenient distribution, it may make sense to
            transform. It will change how the model is used and interpreted,
            usually making it more dense and harder to interpret clearly, but
            that may or may not be worthwhile. For example, if you take the log of a numeric outcome and the log of a numeric predictor, the result has to be interpreted as an elasticity between them, which can be awkward to work with and is often not what is desired.

          • If you think that a log transformation would be desirable for a
            variable, but it has a lot of observations with a value of 0, then
            log transformation isn't really an option for you, whether it would
            be convenient or not. (Adding a "small value" to the 0 observations
            causes lots of problems-- take the logs of 1-10, and then 0.0 to
            1.0).





          share|improve this answer









          $endgroup$












          • $begingroup$
            Assume I've numeric column such as price and it's heavily left skewed. I'm thinking of using few basic classification algorithms. What should be my approach? Should I go for log transformation or boxcox transformation?
            $endgroup$
            – user214
            5 hours ago










          • $begingroup$
            @user214 Left-skewed price information? That sounds interesting! (My research data is generally skewed hard to the right). There is always variation between study contexts, but I generally think of money as "geometric enough" that a log transformation is appropriate (or at least strongly defensible). Whether or not that's the ideal transformation is a very difficult question to answer, but log transformation is unlikely to be a problem for you here. You'll just need to remember that anything about that predictor will be reported on a log scale, and interpret accordingly.
            $endgroup$
            – Upper_Case
            5 hours ago















          3












          $begingroup$

          There are issues that will depend on specific features of your data and analytic approach, but in general skewed data (in either direction) will degrade some of your model's ability to describe more "typical" cases in order to deal with much rarer cases which happen to take extreme values.



          Since "typical" cases are more common than extreme ones in a skewed data set, you are losing some precision with the cases you'll see most often in order to accommodate cases that you'll see only rarely. Determining a coefficient for a thousand observations which are all between [0,10] is likely to be more precise than for 990 observations between [0,10] and 10 observations between [1,000, 1,000,000]. This can lead to your model being less useful overall.



          "Fixing" skewness can provide a variety of benefits, including making analysis which depends on the data being approximately Normally distributed possible/more informative. It can also produce results which are reported on a sensible scale (this is very situation-dependent), and prevent extreme values (relative to other predictors) from over- or underestimating the influence of the skewed predictor on the predicted classification.



          You can test this somewhat (in a non-definitive way, to be sure) by training models with varying subsets of your data: everything you've got, just as it is, your data without that skewed variable, your data with that variable but excluding values outside of the "typical" range (though you'll have to be careful in defining that), your data with the skewed variable distribution transformed or re-scaled, etc.



          As for fixing it, transformations and re-scaling often make sense. But I cannot emphasize enough:



          Fiddling with variables and their distributions should follow from properties of those variables, not your convenience in modelling.



          Log-transforming skewed variables is a prime example of this:



          • If you really think that a variable operates on a geometric scale,
            and you want your model to operate on an arithmetic scale, then log
            transformation can make a lot of sense.

          • If you think that variable operates on an arithmetic scale, but you
            find its distribution inconvenient and think a log transformation
            would produce a more convenient distribution, it may make sense to
            transform. It will change how the model is used and interpreted,
            usually making it more dense and harder to interpret clearly, but
            that may or may not be worthwhile. For example, if you take the log of a numeric outcome and the log of a numeric predictor, the result has to be interpreted as an elasticity between them, which can be awkward to work with and is often not what is desired.

          • If you think that a log transformation would be desirable for a
            variable, but it has a lot of observations with a value of 0, then
            log transformation isn't really an option for you, whether it would
            be convenient or not. (Adding a "small value" to the 0 observations
            causes lots of problems-- take the logs of 1-10, and then 0.0 to
            1.0).





          share|improve this answer









          $endgroup$












          • $begingroup$
            Assume I've numeric column such as price and it's heavily left skewed. I'm thinking of using few basic classification algorithms. What should be my approach? Should I go for log transformation or boxcox transformation?
            $endgroup$
            – user214
            5 hours ago










          • $begingroup$
            @user214 Left-skewed price information? That sounds interesting! (My research data is generally skewed hard to the right). There is always variation between study contexts, but I generally think of money as "geometric enough" that a log transformation is appropriate (or at least strongly defensible). Whether or not that's the ideal transformation is a very difficult question to answer, but log transformation is unlikely to be a problem for you here. You'll just need to remember that anything about that predictor will be reported on a log scale, and interpret accordingly.
            $endgroup$
            – Upper_Case
            5 hours ago













          3












          3








          3





          $begingroup$

          There are issues that will depend on specific features of your data and analytic approach, but in general skewed data (in either direction) will degrade some of your model's ability to describe more "typical" cases in order to deal with much rarer cases which happen to take extreme values.



          Since "typical" cases are more common than extreme ones in a skewed data set, you are losing some precision with the cases you'll see most often in order to accommodate cases that you'll see only rarely. Determining a coefficient for a thousand observations which are all between [0,10] is likely to be more precise than for 990 observations between [0,10] and 10 observations between [1,000, 1,000,000]. This can lead to your model being less useful overall.



          "Fixing" skewness can provide a variety of benefits, including making analysis which depends on the data being approximately Normally distributed possible/more informative. It can also produce results which are reported on a sensible scale (this is very situation-dependent), and prevent extreme values (relative to other predictors) from over- or underestimating the influence of the skewed predictor on the predicted classification.



          You can test this somewhat (in a non-definitive way, to be sure) by training models with varying subsets of your data: everything you've got, just as it is, your data without that skewed variable, your data with that variable but excluding values outside of the "typical" range (though you'll have to be careful in defining that), your data with the skewed variable distribution transformed or re-scaled, etc.



          As for fixing it, transformations and re-scaling often make sense. But I cannot emphasize enough:



          Fiddling with variables and their distributions should follow from properties of those variables, not your convenience in modelling.



          Log-transforming skewed variables is a prime example of this:



          • If you really think that a variable operates on a geometric scale,
            and you want your model to operate on an arithmetic scale, then log
            transformation can make a lot of sense.

          • If you think that variable operates on an arithmetic scale, but you
            find its distribution inconvenient and think a log transformation
            would produce a more convenient distribution, it may make sense to
            transform. It will change how the model is used and interpreted,
            usually making it more dense and harder to interpret clearly, but
            that may or may not be worthwhile. For example, if you take the log of a numeric outcome and the log of a numeric predictor, the result has to be interpreted as an elasticity between them, which can be awkward to work with and is often not what is desired.

          • If you think that a log transformation would be desirable for a
            variable, but it has a lot of observations with a value of 0, then
            log transformation isn't really an option for you, whether it would
            be convenient or not. (Adding a "small value" to the 0 observations
            causes lots of problems-- take the logs of 1-10, and then 0.0 to
            1.0).





          share|improve this answer









          $endgroup$



          There are issues that will depend on specific features of your data and analytic approach, but in general skewed data (in either direction) will degrade some of your model's ability to describe more "typical" cases in order to deal with much rarer cases which happen to take extreme values.



          Since "typical" cases are more common than extreme ones in a skewed data set, you are losing some precision with the cases you'll see most often in order to accommodate cases that you'll see only rarely. Determining a coefficient for a thousand observations which are all between [0,10] is likely to be more precise than for 990 observations between [0,10] and 10 observations between [1,000, 1,000,000]. This can lead to your model being less useful overall.



          "Fixing" skewness can provide a variety of benefits, including making analysis which depends on the data being approximately Normally distributed possible/more informative. It can also produce results which are reported on a sensible scale (this is very situation-dependent), and prevent extreme values (relative to other predictors) from over- or underestimating the influence of the skewed predictor on the predicted classification.



          You can test this somewhat (in a non-definitive way, to be sure) by training models with varying subsets of your data: everything you've got, just as it is, your data without that skewed variable, your data with that variable but excluding values outside of the "typical" range (though you'll have to be careful in defining that), your data with the skewed variable distribution transformed or re-scaled, etc.



          As for fixing it, transformations and re-scaling often make sense. But I cannot emphasize enough:



          Fiddling with variables and their distributions should follow from properties of those variables, not your convenience in modelling.



          Log-transforming skewed variables is a prime example of this:



          • If you really think that a variable operates on a geometric scale,
            and you want your model to operate on an arithmetic scale, then log
            transformation can make a lot of sense.

          • If you think that variable operates on an arithmetic scale, but you
            find its distribution inconvenient and think a log transformation
            would produce a more convenient distribution, it may make sense to
            transform. It will change how the model is used and interpreted,
            usually making it more dense and harder to interpret clearly, but
            that may or may not be worthwhile. For example, if you take the log of a numeric outcome and the log of a numeric predictor, the result has to be interpreted as an elasticity between them, which can be awkward to work with and is often not what is desired.

          • If you think that a log transformation would be desirable for a
            variable, but it has a lot of observations with a value of 0, then
            log transformation isn't really an option for you, whether it would
            be convenient or not. (Adding a "small value" to the 0 observations
            causes lots of problems-- take the logs of 1-10, and then 0.0 to
            1.0).






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 5 hours ago









          Upper_CaseUpper_Case

          1312




          1312











          • $begingroup$
            Assume I've numeric column such as price and it's heavily left skewed. I'm thinking of using few basic classification algorithms. What should be my approach? Should I go for log transformation or boxcox transformation?
            $endgroup$
            – user214
            5 hours ago










          • $begingroup$
            @user214 Left-skewed price information? That sounds interesting! (My research data is generally skewed hard to the right). There is always variation between study contexts, but I generally think of money as "geometric enough" that a log transformation is appropriate (or at least strongly defensible). Whether or not that's the ideal transformation is a very difficult question to answer, but log transformation is unlikely to be a problem for you here. You'll just need to remember that anything about that predictor will be reported on a log scale, and interpret accordingly.
            $endgroup$
            – Upper_Case
            5 hours ago
















          • $begingroup$
            Assume I've numeric column such as price and it's heavily left skewed. I'm thinking of using few basic classification algorithms. What should be my approach? Should I go for log transformation or boxcox transformation?
            $endgroup$
            – user214
            5 hours ago










          • $begingroup$
            @user214 Left-skewed price information? That sounds interesting! (My research data is generally skewed hard to the right). There is always variation between study contexts, but I generally think of money as "geometric enough" that a log transformation is appropriate (or at least strongly defensible). Whether or not that's the ideal transformation is a very difficult question to answer, but log transformation is unlikely to be a problem for you here. You'll just need to remember that anything about that predictor will be reported on a log scale, and interpret accordingly.
            $endgroup$
            – Upper_Case
            5 hours ago















          $begingroup$
          Assume I've numeric column such as price and it's heavily left skewed. I'm thinking of using few basic classification algorithms. What should be my approach? Should I go for log transformation or boxcox transformation?
          $endgroup$
          – user214
          5 hours ago




          $begingroup$
          Assume I've numeric column such as price and it's heavily left skewed. I'm thinking of using few basic classification algorithms. What should be my approach? Should I go for log transformation or boxcox transformation?
          $endgroup$
          – user214
          5 hours ago












          $begingroup$
          @user214 Left-skewed price information? That sounds interesting! (My research data is generally skewed hard to the right). There is always variation between study contexts, but I generally think of money as "geometric enough" that a log transformation is appropriate (or at least strongly defensible). Whether or not that's the ideal transformation is a very difficult question to answer, but log transformation is unlikely to be a problem for you here. You'll just need to remember that anything about that predictor will be reported on a log scale, and interpret accordingly.
          $endgroup$
          – Upper_Case
          5 hours ago




          $begingroup$
          @user214 Left-skewed price information? That sounds interesting! (My research data is generally skewed hard to the right). There is always variation between study contexts, but I generally think of money as "geometric enough" that a log transformation is appropriate (or at least strongly defensible). Whether or not that's the ideal transformation is a very difficult question to answer, but log transformation is unlikely to be a problem for you here. You'll just need to remember that anything about that predictor will be reported on a log scale, and interpret accordingly.
          $endgroup$
          – Upper_Case
          5 hours ago

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48711%2fwhat-are-the-disadvantages-of-having-a-left-skewed-distribution%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Can not update quote_id field of “quote_item” table magento 2Magento 2.1 - We can't remove the item. (Shopping Cart doesnt allow us to remove items before becomes empty)Add value for custom quote item attribute using REST apiREST API endpoint v1/carts/cartId/items always returns error messageCorrect way to save entries to databaseHow to remove all associated quote objects of a customer completelyMagento 2 - Save value from custom input field to quote_itemGet quote_item data using quote id and product id filter in Magento 2How to set additional data to quote_item table from controller in Magento 2?What is the purpose of additional_data column in quote_item table in magento2Set Custom Price to Quote item magento2 from controller

          Magento 2 disable Secret Key on URL's from terminal The Next CEO of Stack OverflowMagento 2 Shortcut/GUI tool to perform commandline tasks for windowsIn menu add configuration linkMagento oAuth : Generating access token and access secretMagento 2 security key issue in Third-Party API redirect URIPublic actions in admin controllersHow to Disable Cache in Custom WidgetURL Key not changing in Magento 2Product URL Key gets deleted when importing custom options - Magento 2Problem with reindex terminalMagento 2 - bin/magento Commands not working in Cpanel Terminal

          Aasi (pallopeli) Navigointivalikko